dartfx-unf

A high-performance Python implementation of the Universal Numerical Fingerprint (UNF) v6 specification, a format agnostic standard for data fingerprinting.

Overview

dartfx-unf is a blazing-fast, memory-efficient calculator for UNF Version 6. It ensures that your data remains identifiable and consistent across different software versions, file formats, and operating systems by normalizing and hashing the underlying data values rather than the file itself.

Built on top of the Polars engine, it provides native support for massive datasets with a professional-grade CLI and a clean Python API.

This package was vibe coded with Claude Opus 4.6 and Gemini 3 Flash.

Key Features

✅ Full Compliance: Implements the complete UNF v6 spec (Numeric, String, Date/Time, Bit Fields, and Booleans).
🚀 Polars-Powered Speed: Near-C performance using vectorized Rust-based execution.
🧊 Out-of-Core Streaming: Process multi-gigabyte files with constant memory overhead.
📦 Multi-Format: Native support for Parquet and CSV.
📋 Structured Reporting: Generates detailed JSON reports compliant with a built-in schema.
🔗 Dataset Hashing: Combine fingerprints from multiple files into a single dataset-level hash.

Installation

We recommend using uv for fast environment management.

This package is not yet published on PyPI. Install from source using the steps below.

For Users

git clone https://github.com/DataArtifex/dartfx-unf.git
cd dartfx-unf

# Option 1: pip (editable install)
pip install -e .

# Option 2: uv (editable install)
uv pip install -e .

For Developers

git clone https://github.com/DataArtifex/dartfx-unf.git
cd dartfx-unf
uv sync

Quick Start

Command Line Interface

Calculate fingerprints directly from your terminal:

# Basic JSON report
uv run dartfx-unf data.parquet

# Disable automatic date parsing for CSVs
uv run dartfx-unf --no-parse-date data.csv

# Quiet mode (just the hash)
uv run dartfx-unf --quiet data.parquet


# Detailed summary table
uv run dartfx-unf --verbose file1.csv file2.parquet

Python API

Integrate UNF calculation into your data pipelines:

from dartfx.unf import unf_file

# Calculate and print the hash
report = unf_file("results.parquet")
print(f"UNF: {report.result.unf}")

# Export to validated JSON
json_report = report.to_json(validate=True)

📚 Documentation: Complete documentation is accessible at http://dataartifex.org/docs/dartfx-unf

Why Polars?

To meet the high-performance and "streaming" requirements of modern data science, dartfx-unf leverages Polars:

Vectorized Expressions: Normalization steps map directly to efficient SIMD operations.
Lazy Execution: Optimizes I/O and computation order.
Memory Efficiency: Polars' streaming mode allows us to hash files that are larger than the available RAM.

UNF in Practice

The UNF algorithm is format-agnostic and column-order invariant. It ensures that identical dataset values produce the same fingerprint regardless of how they are stored.

Example: Basic Atomic Values

UNF can be used to calculate a fingerprint for a single value (a "vector" of one element). Identical values across different types (e.g., Integer vs Float) or representations (e.g., Date vs String) results in consistent hashes when normalized according to the specification.

Data Type	Value	Normalized Form (§Ia)	Resulting UNF
Numeric	`1` / `1.0`	`+1.e+`	`UNF:6:tv3XYCv524AfmlFyVOhuZg==`
Numeric	`0` / `-0.0`	`+0.e+` / `-0.e+`	See spec for sign details
String	`"A character String"`	`"A character String\n\0"`	`UNF:6:FYqU7uBl885eHMbpco1ooA==`
Date	`2014-01-13`	`"2014-01-13\n\0"`	`UNF:6:PH+jFA4u+yJSs1sIw64dyw==`
Boolean	`true`	`+1.e+`	`UNF:6:tv3XYCv524AfmlFyVOhuZg==`

Example: Data & Format Invariance

The UNF remains identical even if we swap the column order or change the storage format (e.g., from CSV to Parquet).

dataset_v1.csv:

id,name,sex,dob,income
1,Alice,F,2007-01-12,75000
2,Bob,M,1985-05-15,160000
3,Charlie,M,1992-08-20,50000

dataset_v2.csv (Columns reordered):

id,income,dob,name,sex
1,75000,2007-01-12,Alice,F
2,160000,1985-05-15,Bob,M
3,50000,1992-08-20,Charlie,M

All variations yield the same fingerprint:

# Original CSV
$ uv run dartfx-unf --quiet dataset_v1.csv
UNF:6:/iH9nCE4fZqn1rBrrsOc7w==

# CSV with different column order
$ uv run dartfx-unf --quiet dataset_v2.csv
UNF:6:/iH9nCE4fZqn1rBrrsOc7w==

# Binary Parquet version of the same data
$ uv run dartfx-unf --quiet dataset_v1.parquet
UNF:6:/iH9nCE4fZqn1rBrrsOc7w==

💡 Note: If you change the row order, the UNF will change, as the sequence of observations is significant.

Example: Variable (Column) UNFs

You can inspect the fingerprints of individual variables. This is useful for identifying which specific column changed between two versions of a dataset.

$ uv run dartfx-unf --verbose dataset_v1.csv

Output:

COLUMN                         | TYPE         | UNF
--------------------------------------------------------------------------------
id                             | numeric      | UNF:6:AvELPR5QTaBbnq6S22Msow==
name                           | string       | UNF:6:G3RHxSQPXELRGHIJ+FV6qA==
sex                            | string       | UNF:6:VSDSXcRD7ShBmQqv1WR9EA==
dob                            | date         | UNF:6:PH+jFA4u+yJSs1sIw64dyw==
income                         | numeric      | UNF:6:v/5E9kHI79TVvlGYinvxTQ==

Example: Choosing Precision

You can customize the normalization parameters (digits of precision, hash bits, etc.). Changes to these parameters are automatically encoded in the resulting UNF header:

# Standard 7-digit precision (Default)
$ uv run dartfx-unf --quiet dataset_v1.csv
UNF:6:/iH9nCE4fZqn1rBrrsOc7w==

# Higher 9-digit precision
$ uv run dartfx-unf --quiet --digits 9 dataset_v1.csv
UNF:6:N9:NvK8CwEepCVQZdjiFCGf2A==

Contributing

Contributions are welcome! Please see CONTRIBUTING.md for guidelines and the Implementation Roadmap for current progress.

License

This project is licensed under the MIT License. See LICENSE.txt for details.

Name		Name	Last commit message	Last commit date
Latest commit History 18 Commits
.github		.github
benchmarks		benchmarks
docs		docs
src/dartfx/unf		src/dartfx/unf
tests		tests
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
AGENTS.md		AGENTS.md
CLAUDE.md		CLAUDE.md
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
GEMINI.md		GEMINI.md
GOVERNANCE.md		GOVERNANCE.md
IMPLEMENTATION.md		IMPLEMENTATION.md
LICENSE.txt		LICENSE.txt
README.md		README.md
SPECIFICATIONS.md		SPECIFICATIONS.md
UNF_V6.md		UNF_V6.md
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

dartfx-unf

Overview

Key Features

Installation

For Users

For Developers

Quick Start

Command Line Interface

Python API

Why Polars?

UNF in Practice

Example: Basic Atomic Values

Example: Data & Format Invariance

Example: Variable (Column) UNFs

Example: Choosing Precision

Contributing

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

License

dathere/dartfx-unf

Folders and files

Latest commit

History

Repository files navigation

dartfx-unf

Overview

Key Features

Installation

For Users

For Developers

Quick Start

Command Line Interface

Python API

Why Polars?

UNF in Practice

Example: Basic Atomic Values

Example: Data & Format Invariance

Example: Variable (Column) UNFs

Example: Choosing Precision

Contributing

License

About

Resources

License

Code of conduct

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages