Validate CSV data blazingly-fast using JSON Schema Validation (Draft 2020-12) (e.g. up to 780,031 rows/second[^1] using NYC's 311 schema generated by the
schemacommand) & put invalid records into a separate file along with a detailed validation error report. Supports several custom JSON Schema formats & keywords: *currencycustom format with ISO-4217 validation *dynamicEnumcustom keyword that supports enum validation against a CSV on the filesystem or a URL (http/https/ckan & dathere URL schemes supported) *uniqueCombinedWithcustom keyword to validate uniqueness across multiple columns for composite key validation. If no JSON schema file is provided, validates if a CSV conforms to the RFC 4180 standard and is UTF-8 encoded.
Table of Contents | Source: src/cmd/validate.rs | 📇🚀🌐📚🗄️
Description | Examples | Usage | Arguments | Validate Options | Fancy Regex Options | Options For Both Regex Engines | Email Validation Options | Common Options
Description ↩
Validates CSV data using two main modes:
This mode is invoked if a JSON Schema file (draft 2020-12) is provided.
The CSV data is validated against the JSON Schema. If the CSV data is valid, no output files are created and the command returns an exit code of 0.
If invalid records are found, they are put into an "invalid" file, with the rest of the records put into a "valid"" file.
A "validation-errors.tsv" report is also created with the following columns:
- row_number: the row number of the invalid record
- field: the field name of the invalid field
- error: a validation error message detailing why the field is invalid
It uses the JSON Schema Validation Specification (draft 2020-12) to validate the CSV. It validates the structure of the file, as well as the data types and domain/range of the fields. See https://json-schema.org/draft/2020-12/json-schema-validation.html
qsv supports a custom format - currency. This format will only accept a valid currency, defined as:
- ISO Currency Symbol (optional): This is the ISO 4217 three-character code or currency symbol (e.g. USD, EUR, JPY, $, €, ¥, etc.)
- Amount: This is the numerical value of the currency. More than 2 decimal places are allowed.
- Formats: Valid currency formats include: Standard: $1,000.00 or USD1000.00 Negative amounts: ($100.00) or -$100.00 Different styles: 1.000,00 (used in some countries for euros)
qsv also supports two custom keywords - `dynamicEnum` and `uniqueCombinedWith`.dynamicEnum allows for dynamic validation against a reference CSV file.
It can be used to validate against a set of values unknown at the time of schema creation or
when the set of valid values is dynamic or too large to hardcode into the JSON Schema with enum.
The reference CSV file can be local or a URL (http/https, dathere & ckan schemes supported).
The "dynamicEnum" value has the form:
// qsvlite binary variant only supports URIs which can be files on the local filesystem // or remote files (http and https schemes supported) dynamicEnum = "URI|colname" where colname is the column name or column index (0-based)
// use data.csv from the current working directory; use the 1st column for validation dynamicEnum = "data.csv"
// use data.csv in /lookup_dir directory; use the column "Agency" for validation dynamicEnum = "/lookupdir/data.csv|Agency"
// get data.csv; use the 3rd column for validation (2 as the col index is 0-based) dynamicEnum = "https://example.com/data.csv|2"
// on other qsv binary variants, dynamicEnum has expanded caching functionality dynamicEnum = "[cache_name;cache_age]|URI|colname" where cache_name and cache_age are optional
// use data.csv from current working directory; cache it as data with a default // cache age of 3600 seconds i.e. the cached data.csv expires after 1 hour dynamicEnum = "data.csv"
// get data.csv; cache it as custom_name, cache age 600 seconds dynamicEnum = "custom_name;600|https://example.com/data.csv"
// get data.csv; cache it as data, cache age 800 seconds dynamicEnum = ";800|https://example.com/data.csv"
// get the top matching result for nyc_neighborhoods (signaled by trailing ?), // cache it as nyc_neighborhood_data.csv (NOTE: cache name is required when using CKAN scheme) // with a default cache age of 3600 seconds // be sure to set --ckan-api, otherwise it will default to datHere's CKAN (data.dathere.com) dynamicEnum = "nyc_neighborhood_data|ckan:://nyc_neighborhoods?"
// get CKAN resource with id 1234567, cache it as resname, 3600 secs cache age // note that if the resource is a private resource, you'll need to set --ckan-token dynamicEnum = "resname|ckan:://1234567"
// same as above but with a cache age of 100 seconds; use the borough column for validation dynamicEnum = "resname;100|ckan:://1234567|borough
// get us_states.csv from datHere lookup tables dynamicEnum = "dathere://us_states.csv"
If colname is not specified, the first column of the CSV file is read and used for validation.
uniqueCombinedWith allows you to validate that combinations of values across specified columns
are unique. It can be used with either column names or column indices (0-based). For example:
// Validate that combinations of name and email are unique uniqueCombinedWith = ["name", "email"]
// Validate that combinations of columns at indices 1 and 2 are unique uniqueCombinedWith = [1, 2]
// Validate that the combinations of named and indexed columns are unique uniqueCombinedWith = ["name", 2]
When a duplicate combination is found, the validation will fail and the error message will indicate which columns had duplicate combinations (named columns first, then indexed columns). The invalid records will be written to the .invalid file, while valid records will be written to the .valid file.
uniqueCombinedWith complements the standard uniqueItems keyword, which can only validate
uniqueness across a single column.
You can create a JSON Schema file from a reference CSV file using the qsv schema command.
Once the schema is created, you can fine-tune it to your needs and use it to validate other CSV
files that have the same structure.
Be sure to select a "training" CSV file that is representative of the data you want to validate when creating a schema. The data types, domain/range and regular expressions inferred from the reference CSV file should be appropriate for the data you want to validate.
Typically, after creating a schema, you should edit it to fine-tune each field's inferred validation rules.
For example, if we created a JSON schema file called "reference.schema.json" using the schema command.
And want to validate "mydata.csv" which we know has validation errors, the output files from running
qsv validate mydata.csv reference.schema.json are:
- mydata.csv.valid
- mydata.csv.invalid
- mydata.csv.validation-errors.tsv
With an exit code of 1 to indicate a validation error.
If we validate another CSV file, "mydata2.csv", which we know is valid, there are no output files, and the exit code is 0.
If piped from stdin, the filenames will use stdin.csv as the base filename. For example:
cat mydata.csv | qsv validate reference.schema.json
- stdin.csv.valid
- stdin.csv.invalid
- stdin.csv.validation-errors.tsv
validate also has a schema subcommand to validate JSON Schema files themselves. E.g.
qsv validate schema myjsonschema.json
// ignore format validation
qsv validate schema --no-format-validation myjsonschema.json
If run without a JSON Schema file, the CSV is validated for RFC 4180 CSV standard compliance (see https://github.com/dathere/qsv#rfc-4180-csv-standard).
It also confirms if the CSV is UTF-8 encoded.
For both modes, returns exit code 0 when the CSV file is valid, exitcode > 0 otherwise. If all records are valid, no output files are produced.
Examples ↩
Validate a CSV file. Use this to check if a CSV file is readable by qsv.
qsv validate data.csvValidate a TSV file against a JSON Schema
qsv validate data.tsv schema.jsonValidate multiple CSV files using various dialects against a JSON Schema
qsv validate data1.csv data2.tab data3.ssv schema.jsonValidate all CSV files in a directory against a JSON Schema
qsv validate /path/to/csv_directory schema.jsonValidate CSV files listed in a '.infile-list' file against a JSON Schema
qsv validate files.infile-list schema.jsonFor more examples, see the tests included in this file (denoted by '#[test]') or see
https://github.com/dathere/qsv/blob/master/tests/test_validate.rs.
Usage ↩
qsv validate schema [--no-format-validation] [<json-schema>]
qsv validate [options] [<input>...]
qsv validate [options] [<input>] <json-schema>
qsv validate --helpArguments ↩
| Argument | Description |
|---|---|
<input> |
... Input CSV file(s) to validate. If not provided, will read from stdin. If input is a directory, all files in the directory will be validated. If the input is a file with a '.infile-list' extension, the file will be read as a list of input files. If the input are snappy-compressed files(s), it will be decompressed automatically. Extended Input Support is only available for RFC 4180 validation mode. |
<json-schema> |
JSON Schema file to validate against. If not provided, validate will run in RFC 4180 validation mode. The file can be a local file or a URL (http and https schemes supported). |
Validate Options ↩
| Option | Type | Description | Default |
|---|---|---|---|
‑‑trim |
flag | Trim leading and trailing whitespace from fields before validating. | |
‑‑no‑format‑validation |
flag | Disable JSON Schema format validation. Ignores all JSON Schema "format" keywords (e.g. date,email, uri, currency, etc.). This is useful when you want to validate the structure of the CSV file w/o worrying about the data types and domain/range of the fields. | |
‑‑fail‑fast |
flag | Stops on first error. | |
‑‑valid |
string | Valid record output file suffix. | valid |
‑‑invalid |
string | Invalid record output file suffix. | invalid |
‑‑json |
flag | When validating without a JSON Schema, return the RFC 4180 check as a JSON file instead of a message. | |
‑‑pretty‑json |
flag | Same as --json, but pretty printed. | |
‑‑valid‑output |
string | Change validation mode behavior so if ALL rows are valid, to pass it to output, return exit code 1, and set stderr to the number of valid rows. Setting this will override the default behavior of creating a valid file only when there are invalid records. To send valid records to stdout, use - as the filename. |
|
‑j,‑‑jobs |
string | The number of jobs to run in parallel. When not set, the number of jobs is set to the number of CPUs detected. | |
‑b,‑‑batch |
string | The number of rows per batch to load into memory, before running in parallel. Automatically determined for CSV files with more than 50000 rows. Set to 0 to load all rows in one batch. Set to 1 to force batch optimization even for files with less than 50000 rows. | 50000 |
Fancy Regex Options ↩
| Option | Type | Description | Default |
|---|---|---|---|
‑‑fancy‑regex |
flag | Use the fancy regex engine instead of the default regex engine for validation. The fancy engine supports advanced regex features such as lookaround and backreferences, but is not as performant as the default regex engine which guarantees linear-time matching, prevents DoS attacks, and is more efficient for simple patterns. | |
‑‑backtrack‑limit |
string | Set the approximate number of backtracking steps allowed. This is only used when --fancy-regex is set. | 1000000 |
Options For Both Regex Engines ↩
| Option | Type | Description | Default |
|---|---|---|---|
‑‑size‑limit |
string | Set the approximate size limit, in megabytes, of a compiled regex. | 50 |
‑‑dfa‑size‑limit |
string | Set the approximate capacity, in megabytes, of the cache of transitions used by the engine's lazy Discrete Finite Automata. | 10 |
‑‑timeout |
string | Timeout for downloading json-schemas on URLs and for 'dynamicEnum' lookups on URLs. If 0, no timeout is used. | 30 |
‑‑cache‑dir |
string | The directory to use for caching downloaded dynamicEnum resources. If the directory does not exist, qsv will attempt to create it. If the QSV_CACHE_DIR envvar is set, it will be used instead. Not available on qsvlite. | ~/.qsv-cache |
‑‑ckan‑api |
string | The URL of the CKAN API to use for downloading dynamicEnum resources with the "ckan://" scheme. If the QSV_CKAN_API envvar is set, it will be used instead. Not available on qsvlite. | https://data.dathere.com/api/3/action |
‑‑ckan‑token |
string | The CKAN API token to use. Only required if downloading private resources. If the QSV_CKAN_TOKEN envvar is set, it will be used instead. Not available on qsvlite. |
Email Validation Options ↩
| Option | Type | Description | Default |
|---|---|---|---|
‑‑email‑required‑tld |
flag | Require the email to have a valid Top-Level Domain (TLD) (e.g. .com, .org, .net, etc.). e.g. "john.doe@example" is VALID if this option is NOT set. | |
‑‑email‑display‑text |
flag | Allow display text in emails. e.g. "John Doe john.doe@example.com" is INVALID if this option is NOT set. | |
‑‑email‑min‑subdomains |
string | Minimum number of subdomains required in the email. e.g. "jdoe@example.com" is INVALID if this option is set to 3, but "jdoe@sub.example.com" is VALID. | 2 |
‑‑email‑domain‑literal |
flag | Allow domain literals in emails. e.g. "john.doe@[127.0.0.1]" is VALID if this option is set. |
Common Options ↩
| Option | Type | Description | Default |
|---|---|---|---|
‑h,‑‑help |
flag | Display this message | |
‑n,‑‑no‑headers |
flag | When set, the first row will not be interpreted as headers. It will be validated with the rest of the rows. Otherwise, the first row will always appear as the header row in the output. Note that this option is only valid when running in RFC 4180 validation mode as JSON Schema validation requires headers. | |
‑d,‑‑delimiter |
string | The field delimiter for reading CSV data. Must be a single character. | |
‑p,‑‑progressbar |
flag | Show progress bars. Not valid for stdin. | |
‑q,‑‑quiet |
flag | Do not display validation summary message. |
Source: src/cmd/validate.rs
| Table of Contents | README