Key features:
- Custom Key Column: Use --keyto specify which column uniquely identifies rows.
- Output Formats: Choose between human-readable (readable), JSON (json), or save JSON to a file (jsonfile).
- Field Selection: Use --fieldsto compare only specific columns, or--ignorefieldsto exclude columns.
- Streaming Mode: For very large files, choose from multiple streaming modes (--streaming=auto|normal|sortedkey|off) to optimize performance.
- Encoding: Specify file encoding with --encoding.
- Show Unchanged: Use --showunchangedto display unchanged fields for changed rows.
- List Fields: Use --listfieldsto print available columns and exit.
- Performance Metrics: Automatically displays elapsed time for diff operations.
db-diff is the most advanced and user-friendly Python CLI tool and library for comparing CSV, TSV, and JSON database dumps. Designed specifically for database workflows, it delivers precise, human-readable or machine-readable diffs, supports custom key columns, handles massive files efficiently with streaming mode, and offers flexible field selection.
- Compare CSV, TSV, and JSON files for differences
- Human-readable and machine-readable (JSON) output
- JSON output to file with filename and path support
- Optional timestamp in output filenames
- Detects added, removed, and changed rows and columns
- Supports custom key columns for row identity
- Field inclusion/exclusion for focused diffs
- Streaming mode for very large files (memory efficient)
- Can output to terminal or JSON file
- Automatic delimiter and encoding detection
- Python 3.6+ compatible
pip install git+https://github.com/datsom1/db-diff.gitTo upgrade to the latest version:
pip install --upgrade --force-reinstall git+https://github.com/datsom1/db-diff.git- Python 3.6 or higher
- Dependencies are automatically installed with pip
Suppose you have two CSV files:
one.csv
Id,name,age
1,Cleo,4
2,Pancakes,2
two.csv
Id,name,age
1,Cleo,5
3,Bailey,1
Compare them using:
db-diff one.csv two.csv --key=IdSample output:
1 rows changed, 1 rows added, 1 rows removed
1 rows changed
  Id: 1
    age: "4" => "5"
1 rows added
  Id: 3
  name: Bailey
  age: 1
1 rows removed
  Id: 2
  name: Pancakes
  age: 2
db-diff is a flexible CLI tool for comparing two data files (CSV, TSV, or JSON). It detects added, removed, and changed rows and columns, and can output results in a human-readable or machine-readable format.
Basic usage:
db-diff [OPTIONS] PREVIOUS CURRENT- PREVIOUSand- CURRENTare the file paths to the two files you want to compare.
- The tool auto-detects file format by extension, or you can specify with --format.
Key features:
- Custom Key Column: Use --keyto specify which column uniquely identifies rows.
- Output Formats: Choose between human-readable (readable), JSON (json), or save JSON to a file (jsonfile).
- Field Selection: Use --fieldsto compare only specific columns, or--ignorefieldsto exclude columns.
- Streaming Mode Options: Choose between different streaming modes for CSV/TSV files:
- --streaming=auto(default): Automatically select the best mode based on file size and key ordering
- --streaming=normal: Memory-efficient but slower processing
- --streaming=sortedkey: Very fast streaming mode for files with sorted key columns
- --streaming=off: Disable streaming regardless of file size
 
- Encoding: Specify file encoding with --encoding.
- Show Unchanged: Use --showunchangedto display unchanged fields for changed rows.
- List Fields: Use --listfieldsto print available columns and exit.
- Timing: Use --timeto display how long the diff operation took.
See all options:
db-diff --helpYou can use db-diff as a Python library for advanced or automated workflows. The library provides functions to load data, compare datasets, and render results.
from db_diff import load_csv, load_json
# Load CSV file, using a specific column as the key
with open("one.csv", encoding="utf-8") as f:
    prev = load_csv(f, key="Id")
# Load JSON file, using a specific key
with open("two.json", encoding="utf-8") as f:
    curr = load_json(f, key="Id")from db_diff import compare
# Compare two datasets (dictionaries keyed by your chosen column)
diff = compare(prev, curr, show_unchanged=False)- show_unchanged: If True, includes unchanged fields for changed rows.
- fields: Pass a set of field names to only compare those fields.
- ignorefields: Pass a set of field names to ignore during comparison.
# For normal streaming mode (memory efficient)
from db_diff import streaming_compare_csv
diff = streaming_compare_csv(
    "one.csv",
    "two.csv",
    key="Id",
    compare_columns={"Id", "name", "age"},
    encoding="utf-8",
    dialect="excel"
)
# For sorted key streaming mode (faster, but requires sorted keys)
from db_diff import streaming_compare_csv_sorted
diff = streaming_compare_csv_sorted(
    "one.csv",
    "two.csv",
    key="Id",
    compare_columns={"Id", "name", "age"},
    encoding="utf-8",
    dialect="excel"
)
# Check if a file's key column is sorted
from db_diff import is_key_column_sorted
is_sorted = is_key_column_sorted(
    "one.csv",
    key="Id",
    encoding="utf-8",
    dialect="excel"
)The result of compare or streaming_compare_csv is a dictionary:
{
    "added": [ ... ],            # List of added rows (dicts)
    "removed": [ ... ],          # List of removed rows (dicts)
    "changed": [                 # List of changed rows
        {
            "key": "row_id",
            "changes": {
                "field1": ["old", "new"],
                ...
            },
            "unchanged": { ... } # (optional) if show_unchanged=True
        },
        ...
    ],
    "columns_added": [ ... ],    # List of columns added
    "columns_removed": [ ... ]   # List of columns removed
}from db_diff import human_text
print(human_text(diff, key="Id", current=curr))from db_diff import load_csv, compare, human_text
with open("one.csv") as f1, open("two.csv") as f2:
    prev = load_csv(f1, key="Id")
    curr = load_csv(f2, key="Id")
    diff = compare(prev, curr, show_unchanged=True)
    print(human_text(diff, key="Id", current=curr))See all available options with:
db-diff --helpA summary of key options:
| Option | Description | 
|---|---|
| --key TEXT | Column to use as a unique ID for each row (default: first column header) | 
| --output TEXT | Output format: readable,json, orjsonfile(default: readable) | 
| --outfilename FILE | File to write JSON output to (used with --output=jsonfile) | 
| --outfilepath DIR | Directory to save the output file (used with --output=jsonfile) | 
| --timestamp | Add timestamp to filename (YYYY-MM-DD_HH-MM-SS_diffs_.json) | 
| --fields TEXT | Comma-separated list of fields to compare (all others ignored) | 
| --ignorefields TEXT | Comma-separated list of fields to ignore during comparison | 
| --showunchanged | Show all fields for changed records, not just changed fields | 
| --format TEXT | Explicitly specify input format: csv,tsv, orjson(default: auto-detect) | 
| --encoding TEXT | Input file encoding (default: utf-8) | 
| --streaming=MODE | Streaming mode for CSV/TSV files. Options: auto(default),normal,sortedkey,off | 
| --listfields | List available fields/columns in the input files and exit | 
| --verbose | Enable verbose output with detailed logging | 
| --samplesize NUM | Number of rows to sample when checking if key columns are sorted (default: 1000) | 
| --version | Show the version and exit | 
| -h, --help | Show help message and exit | 
Show unchanged fields for changed rows:
db-diff one.csv two.csv --key=Id --showunchangedOutput as JSON:
db-diff one.csv two.csv --key=Id --output=jsonSave JSON output to a file:
db-diff one.csv two.csv --key=Id --output=jsonfile --outfilename=diffs.jsonSave JSON output with timestamp in filename:
db-diff one.csv two.csv --key=Id --output=jsonfile --timestampCompare only specific fields:
db-diff one.csv two.csv --key=Id --fields=Id,nameIgnore specific fields:
db-diff one.csv two.csv --key=Id --ignorefields=LastModifiedDateStreaming mode options for large files:
# Auto mode (default) - automatically selects the best streaming mode
db-diff large1.csv large2.csv --key=Id --streaming=auto
# Normal streaming mode - memory efficient but slower
db-diff large1.csv large2.csv --key=Id --streaming=normal
# Sorted key streaming mode - very fast for files with sorted keys
db-diff large1.csv large2.csv --key=Id --streaming=sortedkey
# Disable streaming regardless of file size
db-diff large1.csv large2.csv --key=Id --streaming=off
# Adjust the sample size for key column sorting detection
db-diff large1.csv large2.csv --key=Id --samplesize=5000The streaming modes work with CSV/TSV files only. The auto mode will automatically detect if the key columns are sorted and select the most efficient streaming mode for your data.
Verbose output with detailed logging:
db-diff one.csv two.csv --key=Id --verbose- Choose the appropriate streaming mode for your data:
- auto(default): Automatically selects the best mode based on file properties
- sortedkey: Fastest streaming mode, but requires key columns to be sorted in both files
- normal: Memory-efficient but slower, good for very large unsorted files
- off: Use when you want to process the entire files in memory regardless of size
 
- The tool automatically detects if key columns are sorted and selects the most efficient mode
- For very large files, the sorted key streaming mode can be 2-5x faster than normal streaming
- Streaming modes read files in chunks, so they use significantly less memory than loading entire files
Author: Thomas Coyle
Repository: https://github.com/datsom1/db-diff