db-diff

Key features:

Custom Key Column: Use --key to specify which column uniquely identifies rows.
Output Formats: Choose between human-readable (readable), JSON (json), or save JSON to a file (jsonfile).
Field Selection: Use --fields to compare only specific columns, or --ignorefields to exclude columns.
Streaming Mode: For very large files, choose from multiple streaming modes (--streaming=auto|normal|sortedkey|off) to optimize performance.
Encoding: Specify file encoding with --encoding.
Show Unchanged: Use --showunchanged to display unchanged fields for changed rows.
List Fields: Use --listfields to print available columns and exit.
Performance Metrics: Automatically displays elapsed time for diff operations.

db-diff is the most advanced and user-friendly Python CLI tool and library for comparing CSV, TSV, and JSON database dumps. Designed specifically for database workflows, it delivers precise, human-readable or machine-readable diffs, supports custom key columns, handles massive files efficiently with streaming mode, and offers flexible field selection.

Features

Compare CSV, TSV, and JSON files for differences
Human-readable and machine-readable (JSON) output
JSON output to file with filename and path support
Optional timestamp in output filenames
Detects added, removed, and changed rows and columns
Supports custom key columns for row identity
Field inclusion/exclusion for focused diffs
Streaming mode for very large files (memory efficient)
Can output to terminal or JSON file
Automatic delimiter and encoding detection
Python 3.6+ compatible

Installation

From GitHub (Latest Development Version)

pip install git+https://github.com/datsom1/db-diff.git

To upgrade to the latest version:

pip install --upgrade --force-reinstall git+https://github.com/datsom1/db-diff.git

Requirements

Python 3.6 or higher
Dependencies are automatically installed with pip

Quick Start

Suppose you have two CSV files:

one.csv

Id,name,age
1,Cleo,4
2,Pancakes,2

two.csv

Id,name,age
1,Cleo,5
3,Bailey,1

Compare them using:

db-diff one.csv two.csv --key=Id

Sample output:

1 rows changed, 1 rows added, 1 rows removed

1 rows changed

  Id: 1
    age: "4" => "5"

1 rows added

  Id: 3
  name: Bailey
  age: 1

1 rows removed

  Id: 2
  name: Pancakes
  age: 2

Usage

Command Line

db-diff is a flexible CLI tool for comparing two data files (CSV, TSV, or JSON). It detects added, removed, and changed rows and columns, and can output results in a human-readable or machine-readable format.

Basic usage:

db-diff [OPTIONS] PREVIOUS CURRENT

PREVIOUS and CURRENT are the file paths to the two files you want to compare.
The tool auto-detects file format by extension, or you can specify with --format.

Key features:

Custom Key Column: Use --key to specify which column uniquely identifies rows.
Output Formats: Choose between human-readable (readable), JSON (json), or save JSON to a file (jsonfile).
Field Selection: Use --fields to compare only specific columns, or --ignorefields to exclude columns.
Streaming Mode Options: Choose between different streaming modes for CSV/TSV files:
- --streaming=auto (default): Automatically select the best mode based on file size and key ordering
- --streaming=normal: Memory-efficient but slower processing
- --streaming=sortedkey: Very fast streaming mode for files with sorted key columns
- --streaming=off: Disable streaming regardless of file size
Encoding: Specify file encoding with --encoding.
Show Unchanged: Use --showunchanged to display unchanged fields for changed rows.
List Fields: Use --listfields to print available columns and exit.
Timing: Use --time to display how long the diff operation took.

See all options:

db-diff --help

Python Library

You can use db-diff as a Python library for advanced or automated workflows. The library provides functions to load data, compare datasets, and render results.

Loading Data

from db_diff import load_csv, load_json

# Load CSV file, using a specific column as the key
with open("one.csv", encoding="utf-8") as f:
    prev = load_csv(f, key="Id")

# Load JSON file, using a specific key
with open("two.json", encoding="utf-8") as f:
    curr = load_json(f, key="Id")

Comparing Data

from db_diff import compare

# Compare two datasets (dictionaries keyed by your chosen column)
diff = compare(prev, curr, show_unchanged=False)

show_unchanged: If True, includes unchanged fields for changed rows.
fields: Pass a set of field names to only compare those fields.
ignorefields: Pass a set of field names to ignore during comparison.

Streaming Comparison (for large CSV/TSV files)

# For normal streaming mode (memory efficient)
from db_diff import streaming_compare_csv

diff = streaming_compare_csv(
    "one.csv",
    "two.csv",
    key="Id",
    compare_columns={"Id", "name", "age"},
    encoding="utf-8",
    dialect="excel"
)

# For sorted key streaming mode (faster, but requires sorted keys)
from db_diff import streaming_compare_csv_sorted

diff = streaming_compare_csv_sorted(
    "one.csv",
    "two.csv",
    key="Id",
    compare_columns={"Id", "name", "age"},
    encoding="utf-8",
    dialect="excel"
)

# Check if a file's key column is sorted
from db_diff import is_key_column_sorted

is_sorted = is_key_column_sorted(
    "one.csv",
    key="Id",
    encoding="utf-8",
    dialect="excel"
)

Diff Result Structure

The result of compare or streaming_compare_csv is a dictionary:

{
    "added": [ ... ],            # List of added rows (dicts)
    "removed": [ ... ],          # List of removed rows (dicts)
    "changed": [                 # List of changed rows
        {
            "key": "row_id",
            "changes": {
                "field1": ["old", "new"],
                ...
            },
            "unchanged": { ... } # (optional) if show_unchanged=True
        },
        ...
    ],
    "columns_added": [ ... ],    # List of columns added
    "columns_removed": [ ... ]   # List of columns removed
}

Rendering Human-Readable Output

from db_diff import human_text

print(human_text(diff, key="Id", current=curr))

Example: Full Workflow

from db_diff import load_csv, compare, human_text

with open("one.csv") as f1, open("two.csv") as f2:
    prev = load_csv(f1, key="Id")
    curr = load_csv(f2, key="Id")
    diff = compare(prev, curr, show_unchanged=True)
    print(human_text(diff, key="Id", current=curr))

Options

See all available options with:

db-diff --help

A summary of key options:

Option	Description
`--key TEXT`	Column to use as a unique ID for each row (default: first column header)
`--output TEXT`	Output format: `readable`, `json`, or `jsonfile` (default: readable)
`--outfilename FILE`	File to write JSON output to (used with `--output=jsonfile`)
`--outfilepath DIR`	Directory to save the output file (used with `--output=jsonfile`)
`--timestamp`	Add timestamp to filename (YYYY-MM-DD_HH-MM-SS_diffs_.json)
`--fields TEXT`	Comma-separated list of fields to compare (all others ignored)
`--ignorefields TEXT`	Comma-separated list of fields to ignore during comparison
`--showunchanged`	Show all fields for changed records, not just changed fields
`--format TEXT`	Explicitly specify input format: `csv`, `tsv`, or `json` (default: auto-detect)
`--encoding TEXT`	Input file encoding (default: utf-8)
`--streaming=MODE`	Streaming mode for CSV/TSV files. Options: `auto` (default), `normal`, `sortedkey`, `off`
`--listfields`	List available fields/columns in the input files and exit
`--verbose`	Enable verbose output with detailed logging
`--samplesize NUM`	Number of rows to sample when checking if key columns are sorted (default: 1000)
`--version`	Show the version and exit
`-h, --help`	Show help message and exit

Examples

Show unchanged fields for changed rows:

db-diff one.csv two.csv --key=Id --showunchanged

Output as JSON:

db-diff one.csv two.csv --key=Id --output=json

Save JSON output to a file:

db-diff one.csv two.csv --key=Id --output=jsonfile --outfilename=diffs.json

Save JSON output with timestamp in filename:

db-diff one.csv two.csv --key=Id --output=jsonfile --timestamp

Compare only specific fields:

db-diff one.csv two.csv --key=Id --fields=Id,name

Ignore specific fields:

db-diff one.csv two.csv --key=Id --ignorefields=LastModifiedDate

Streaming mode options for large files:

# Auto mode (default) - automatically selects the best streaming mode
db-diff large1.csv large2.csv --key=Id --streaming=auto

# Normal streaming mode - memory efficient but slower
db-diff large1.csv large2.csv --key=Id --streaming=normal

# Sorted key streaming mode - very fast for files with sorted keys
db-diff large1.csv large2.csv --key=Id --streaming=sortedkey

# Disable streaming regardless of file size
db-diff large1.csv large2.csv --key=Id --streaming=off

# Adjust the sample size for key column sorting detection
db-diff large1.csv large2.csv --key=Id --samplesize=5000

The streaming modes work with CSV/TSV files only. The auto mode will automatically detect if the key columns are sorted and select the most efficient streaming mode for your data.

Verbose output with detailed logging:

db-diff one.csv two.csv --key=Id --verbose

Performance

Choose the appropriate streaming mode for your data:
- auto (default): Automatically selects the best mode based on file properties
- sortedkey: Fastest streaming mode, but requires key columns to be sorted in both files
- normal: Memory-efficient but slower, good for very large unsorted files
- off: Use when you want to process the entire files in memory regardless of size
The tool automatically detects if key columns are sorted and selects the most efficient mode
For very large files, the sorted key streaming mode can be 2-5x faster than normal streaming
Streaming modes read files in chunks, so they use significantly less memory than loading entire files

Author: Thomas Coyle
Repository: https://github.com/datsom1/db-diff

Name		Name	Last commit message	Last commit date
Latest commit History 173 Commits
db_diff		db_diff
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

db-diff

Table of Contents

Features

Installation

From GitHub (Latest Development Version)

Requirements

Quick Start

Usage

Command Line

Python Library

Loading Data

Comparing Data

Streaming Comparison (for large CSV/TSV files)

Diff Result Structure

Rendering Human-Readable Output

Example: Full Workflow

Options

Examples

Performance

About

Uh oh!

Releases 18

Packages

Languages

License

datsom1/db-diff

Folders and files

Latest commit

History

Repository files navigation

db-diff

Table of Contents

Features

Installation

From GitHub (Latest Development Version)

Requirements

Quick Start

Usage

Command Line

Python Library

Loading Data

Comparing Data

Streaming Comparison (for large CSV/TSV files)

Diff Result Structure

Rendering Human-Readable Output

Example: Full Workflow

Options

Examples

Performance

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 18

Packages 0

Languages

Packages