A high-performance Python implementation of the Universal Numerical Fingerprint (UNF) v6 specification, a format agnostic standard for data fingerprinting.
dartfx-unf is a blazing-fast, memory-efficient calculator for UNF Version 6. It ensures that your data remains identifiable and consistent across different software versions, file formats, and operating systems by normalizing and hashing the underlying data values rather than the file itself.
Built on top of the Polars engine, it provides native support for massive datasets with a professional-grade CLI and a clean Python API.
This package was vibe coded with Claude Opus 4.6 and Gemini 3 Flash.
- β Full Compliance: Implements the complete UNF v6 spec (Numeric, String, Date/Time, Bit Fields, and Booleans).
- π Polars-Powered Speed: Near-C performance using vectorized Rust-based execution.
- π§ Out-of-Core Streaming: Process multi-gigabyte files with constant memory overhead.
- π¦ Multi-Format: Native support for Parquet and CSV.
- π Structured Reporting: Generates detailed JSON reports compliant with a built-in schema.
- π Dataset Hashing: Combine fingerprints from multiple files into a single dataset-level hash.
We recommend using uv for fast environment management.
This package is not yet published on PyPI. Install from source using the steps below.
git clone https://github.com/DataArtifex/dartfx-unf.git
cd dartfx-unf
# Option 1: pip (editable install)
pip install -e .
# Option 2: uv (editable install)
uv pip install -e .git clone https://github.com/DataArtifex/dartfx-unf.git
cd dartfx-unf
uv syncCalculate fingerprints directly from your terminal:
# Basic JSON report
uv run dartfx-unf data.parquet
# Disable automatic date parsing for CSVs
uv run dartfx-unf --no-parse-date data.csv
# Quiet mode (just the hash)
uv run dartfx-unf --quiet data.parquet
# Detailed summary table
uv run dartfx-unf --verbose file1.csv file2.parquetIntegrate UNF calculation into your data pipelines:
from dartfx.unf import unf_file
# Calculate and print the hash
report = unf_file("results.parquet")
print(f"UNF: {report.result.unf}")
# Export to validated JSON
json_report = report.to_json(validate=True)π Documentation: Complete documentation is accessible at http://dataartifex.org/docs/dartfx-unf
To meet the high-performance and "streaming" requirements of modern data science, dartfx-unf leverages Polars:
- Vectorized Expressions: Normalization steps map directly to efficient SIMD operations.
- Lazy Execution: Optimizes I/O and computation order.
- Memory Efficiency: Polars' streaming mode allows us to hash files that are larger than the available RAM.
The UNF algorithm is format-agnostic and column-order invariant. It ensures that identical dataset values produce the same fingerprint regardless of how they are stored.
UNF can be used to calculate a fingerprint for a single value (a "vector" of one element). Identical values across different types (e.g., Integer vs Float) or representations (e.g., Date vs String) results in consistent hashes when normalized according to the specification.
| Data Type | Value | Normalized Form (Β§Ia) | Resulting UNF |
|---|---|---|---|
| Numeric | 1 / 1.0 |
+1.e+ |
UNF:6:tv3XYCv524AfmlFyVOhuZg== |
| Numeric | 0 / -0.0 |
+0.e+ / -0.e+ |
See spec for sign details |
| String | "A character String" |
"A character String\n\0" |
UNF:6:FYqU7uBl885eHMbpco1ooA== |
| Date | 2014-01-13 |
"2014-01-13\n\0" |
UNF:6:PH+jFA4u+yJSs1sIw64dyw== |
| Boolean | true |
+1.e+ |
UNF:6:tv3XYCv524AfmlFyVOhuZg== |
The UNF remains identical even if we swap the column order or change the storage format (e.g., from CSV to Parquet).
dataset_v1.csv:
id,name,sex,dob,income
1,Alice,F,2007-01-12,75000
2,Bob,M,1985-05-15,160000
3,Charlie,M,1992-08-20,50000dataset_v2.csv (Columns reordered):
id,income,dob,name,sex
1,75000,2007-01-12,Alice,F
2,160000,1985-05-15,Bob,M
3,50000,1992-08-20,Charlie,MAll variations yield the same fingerprint:
# Original CSV
$ uv run dartfx-unf --quiet dataset_v1.csv
UNF:6:/iH9nCE4fZqn1rBrrsOc7w==
# CSV with different column order
$ uv run dartfx-unf --quiet dataset_v2.csv
UNF:6:/iH9nCE4fZqn1rBrrsOc7w==
# Binary Parquet version of the same data
$ uv run dartfx-unf --quiet dataset_v1.parquet
UNF:6:/iH9nCE4fZqn1rBrrsOc7w==π‘ Note: If you change the row order, the UNF will change, as the sequence of observations is significant.
You can inspect the fingerprints of individual variables. This is useful for identifying which specific column changed between two versions of a dataset.
$ uv run dartfx-unf --verbose dataset_v1.csvOutput:
COLUMN | TYPE | UNF
--------------------------------------------------------------------------------
id | numeric | UNF:6:AvELPR5QTaBbnq6S22Msow==
name | string | UNF:6:G3RHxSQPXELRGHIJ+FV6qA==
sex | string | UNF:6:VSDSXcRD7ShBmQqv1WR9EA==
dob | date | UNF:6:PH+jFA4u+yJSs1sIw64dyw==
income | numeric | UNF:6:v/5E9kHI79TVvlGYinvxTQ==
You can customize the normalization parameters (digits of precision, hash bits, etc.). Changes to these parameters are automatically encoded in the resulting UNF header:
# Standard 7-digit precision (Default)
$ uv run dartfx-unf --quiet dataset_v1.csv
UNF:6:/iH9nCE4fZqn1rBrrsOc7w==
# Higher 9-digit precision
$ uv run dartfx-unf --quiet --digits 9 dataset_v1.csv
UNF:6:N9:NvK8CwEepCVQZdjiFCGf2A==Contributions are welcome! Please see CONTRIBUTING.md for guidelines and the Implementation Roadmap for current progress.
This project is licensed under the MIT License. See LICENSE.txt for details.