Skip to content

ValidateX is a lightweight, extensible data quality validation framework for Python that helps ensure dataset accuracy, consistency, and reliability with automated reporting and quality scoring.

License

Notifications You must be signed in to change notification settings

kaviarasanmani/ValidateX

πŸš€ ValidateX

A powerful, extensible data quality validation framework for Python.

Build Status (Tests & CI) Code Coverage Test Passing Rate PyPI Latest Version Supported Python Versions MIT License Code Style: black

Badges represent (from left to right): CI/CD Build Status, Code Coverage, Test Count, Latest PyPI Release, Supported Python Versions, License, and Code Style.

ValidateX provides a comprehensive suite of tools for validating, profiling, and monitoring data quality across Pandas and PySpark DataFrames. Inspired by Great Expectations, it offers a simpler, more focused approach with modern, production-ready HTML reports and an intuitive API.

πŸ“‘ Table of Contents


πŸ–ΌοΈ Report Preview

ValidateX Report β€” Overview

Column Health Summary

Column Health Summary with mini bar charts

Expectations Table

Severity-tagged Expectations with human-readable output


πŸ€” Why ValidateX?

Feature ValidateX Great Expectations
Setup pip install β†’ validate in 5 lines Multi-step setup with contexts & stores
API Fluent, chainable Python API Heavy config system
Severity levels βœ” (Critical, Warning, Info) ❌
Quality score βœ” (Weighted 0–100) ❌
Auto-suggest expectations βœ” βœ”
Reports Modern dark-theme HTML with minicharts Basic data docs
Output Data Types Clean native Python types NumPy types leak into JSON
PySpark Support βœ” βœ”
Polars Support Soon βœ”
CI/CD friendly CLI βœ” ❌
Downloads JSON / CSV / clipboard built into report Separate export
Learning curve Minutes Hours to days

ValidateX is not a replacement for Great Expectations β€” it's a focused alternative for teams that want production-grade data validation without the overhead.


🎯 Who Is This For?

  • Startup data teams β€” Ship data quality checks in minutes, not days
  • ML engineers β€” Validate feature stores and training data before model runs
  • CI/CD pipelines β€” Gate deployments on data quality with a single CLI command
  • Analytics teams β€” Catch data issues before they reach dashboards
  • dbt users β€” Lightweight validation alongside your transformation layer
  • Data platform teams β€” Monitor data quality across dozens of tables

✨ Features

Feature Description
25+ Built-in Expectations Column-level, table-level, and aggregate validations
Dual Engine Support Pandas and PySpark execution engines
🎯 Data Quality Score Weighted score (0–100) based on severity of checks
πŸ”΄πŸŸ‘πŸ”΅ Severity Levels Critical / Warning / Info classification for every expectation
πŸ“Š Column Health Summary At-a-glance per-column health with mini bar charts
Modern HTML Reports Stunning, self-contained dark-theme reports with animations
πŸ“₯ Download Buttons Export reports as JSON, CSV, or copy summary to clipboard
πŸ“ˆ Drift Detection Track changes between validation runs
Data Profiling Auto-analyse datasets and suggest expectations
YAML/JSON Config Define expectations declaratively
CLI Interface Run validations from the command line
Checkpoint System Tie data sources and suites together
Extensible Create custom expectations with the registry pattern
Clean Output All values are native Python types β€” zero NumPy leakage

πŸ“¦ Installation

# Basic install
pip install validatex

# With PySpark support
pip install "validatex[spark]"

# With database support
pip install "validatex[database]"

# Full install
pip install "validatex[all]"

# Development
pip install "validatex[dev]"

🏁 Quick Start

Python API

import pandas as pd
import validatex as vx

# Create your data
df = pd.DataFrame({
    "user_id": [1, 2, 3, 4, 5],
    "name": ["Alice", "Bob", "Charlie", "Diana", "Eve"],
    "age": [25, 30, 35, 28, 42],
    "email": ["alice@test.com", "bob@test.com", "charlie@test.com",
              "diana@test.com", "eve@test.com"],
    "status": ["active", "active", "inactive", "active", "pending"],
})

# Build an expectation suite
suite = (
    vx.ExpectationSuite("user_quality")
    .add("expect_column_to_not_be_null", column="user_id")
    .add("expect_column_values_to_be_unique", column="user_id")
    .add("expect_column_values_to_be_between", column="age", min_value=0, max_value=150)
    .add("expect_column_values_to_be_in_set",
         column="status", value_set=["active", "inactive", "pending"])
    .add("expect_column_values_to_match_regex",
         column="email", regex=r"^[\w.]+@[\w]+\.\w+$")
)

# Validate
result = vx.validate(df, suite)

# Print summary (includes Quality Score)
print(result.summary())

# Generate reports
result.to_html("report.html")
result.to_json_file("report.json")

CLI

# Initialize a project
validatex init

# Profile a dataset
validatex profile --data data.csv --suggest --output auto_suite.yaml

# Run validation
validatex validate --data data.csv --suite suite.yaml --report report.html

# Run checkpoint
validatex run --checkpoint checkpoint.yaml

# List available expectations
validatex list-expectations

πŸ€– Automate with CI/CD

ValidateX is designed to be lightweight and CI-friendly. You can easily integrate it into your GitHub Actions, GitLab CI, or Jenkins pipelines to gate deployments on data quality.

Example: GitHub Actions

name: Data Quality Validation
on: [push, pull_request]

jobs:
  validate-data:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      
      - name: Set up Python
        uses: actions/setup-python@v5
        with:
          python-version: '3.11'
          
      - name: Install ValidateX
        run: pip install validatex
        
      - name: Run Data Validation
        run: |
          validatex validate \
            --data data/production_data.csv \
            --suite tests/data_quality/suite.yaml \
            --report dq_report.html
            
      - name: Archive production artifacts
        uses: actions/upload-artifact@v4
        if: always()
        with:
          name: validatex-report
          path: dq_report.html

🎯 Data Quality Score

ValidateX computes a weighted quality score (0–100) based on the severity of each expectation:

Severity Weight Example Expectations
πŸ”΄ Critical Γ—3 Null checks, uniqueness, column existence, row count
🟑 Warning Γ—2 Range checks, set membership, regex, type checks
πŸ”΅ Info Γ—1 Mean/stdev bounds, string lengths, distinct values

Formula: Score = 100 Γ— (weighted_passed / weighted_total)

A critical failure impacts the score 3Γ— more than an info-level check. This gives decision-makers a single number to assess data health.

result = vx.validate(df, suite)
score = result.compute_quality_score()
print(f"Data Quality Score: {score}/100")

Custom Severity

Override the default severity on any expectation via meta:

expectations:
  - expectation_type: expect_column_mean_to_be_between
    column: revenue
    kwargs:
      min_value: 1000
      max_value: 50000
    meta:
      severity: critical   # Override default "info" β†’ "critical"

πŸ“Š Column Health Summary

The HTML report includes a Column Health Summary that aggregates all expectations per column:

Column Checks Passed Failed Health Null % Unique %
user_id 3 3 0 100% β–ˆβ–ˆβ–ˆ 0.0% 100.0% β–ˆβ–ˆβ–ˆ
email 4 4 0 100% β–ˆβ–ˆβ–ˆ 0.0% 100.0% β–ˆβ–ˆβ–ˆ
status 1 1 0 100% β–ˆβ–ˆβ–ˆ β€” β€”

Each metric includes a mini CSS bar chart for instant visual scanning.

for col in result.column_health():
    print(f"{col.column}: {col.health_score}% health, "
          f"{col.passed}/{col.checks} passed")

πŸ“‹ Available Expectations

Column-Level (16)

Expectation Severity Description
expect_column_to_exist πŸ”΄ Critical Column exists in DataFrame
expect_column_to_not_be_null πŸ”΄ Critical No null values
expect_column_values_to_be_unique πŸ”΄ Critical All values unique
expect_column_values_to_be_between 🟑 Warning Values within range
expect_column_values_to_be_in_set 🟑 Warning Values in allowed set
expect_column_values_to_not_be_in_set 🟑 Warning Values not in forbidden set
expect_column_values_to_match_regex 🟑 Warning Values match regex pattern
expect_column_values_to_be_of_type 🟑 Warning Column dtype matches
expect_column_values_to_be_dateutil_parseable 🟑 Warning Values parseable as dates
expect_column_value_lengths_to_be_between πŸ”΅ Info String lengths within range
expect_column_max_to_be_between πŸ”΅ Info Column max within bounds
expect_column_min_to_be_between πŸ”΅ Info Column min within bounds
expect_column_mean_to_be_between πŸ”΅ Info Column mean within bounds
expect_column_stdev_to_be_between πŸ”΅ Info Column std dev within bounds
expect_column_distinct_values_to_be_in_set πŸ”΅ Info All distinct values in set
expect_column_proportion_of_unique_values_to_be_between πŸ”΅ Info Uniqueness ratio in range

Table-Level (5)

Expectation Severity Description
expect_table_row_count_to_equal πŸ”΄ Critical Exact row count
expect_table_row_count_to_be_between πŸ”΄ Critical Row count in range
expect_table_columns_to_match_ordered_list πŸ”΄ Critical Column order matches
expect_table_columns_to_match_set πŸ”΄ Critical Column names match (unordered)
expect_table_column_count_to_equal πŸ”΄ Critical Exact column count

Aggregate / Cross-Column (4)

Expectation Severity Description
expect_column_pair_values_a_to_be_greater_than_b 🟑 Warning Column A > Column B
expect_column_pair_values_to_be_equal 🟑 Warning Two columns equal
expect_multicolumn_sum_to_equal 🟑 Warning Row-wise sum equals target
expect_compound_columns_to_be_unique πŸ”΄ Critical Compound key uniqueness

πŸ“Š Data Profiling

import pandas as pd
from validatex import DataProfiler

df = pd.read_csv("data.csv")
profiler = DataProfiler()

# Profile
profile = profiler.profile(df)
print(profile.summary())

# Auto-suggest expectations
suite = profiler.suggest_expectations(df, suite_name="auto_suite")
suite.save("auto_suite.yaml")

πŸ”§ YAML Suite Configuration

suite_name: my_data_quality
meta:
  description: "Quality checks for production data"

expectations:
  - expectation_type: expect_column_to_not_be_null
    column: id
    meta:
      severity: critical

  - expectation_type: expect_column_values_to_be_between
    column: age
    kwargs:
      min_value: 0
      max_value: 150

  - expectation_type: expect_column_values_to_be_in_set
    column: status
    kwargs:
      value_set: ["active", "inactive"]

πŸ—οΈ Architecture

validatex/
β”œβ”€β”€ core/
β”‚   β”œβ”€β”€ expectation.py     # Base class + registry
β”‚   β”œβ”€β”€ result.py          # ValidationResult, QualityScore, Severity, ColumnHealth
β”‚   β”œβ”€β”€ suite.py           # ExpectationSuite (fluent API)
β”‚   └── validator.py       # Validation orchestrator
β”œβ”€β”€ expectations/
β”‚   β”œβ”€β”€ column_expectations.py     # 16 column-level checks
β”‚   β”œβ”€β”€ table_expectations.py      # 5 table-level checks
β”‚   └── aggregate_expectations.py  # 4 cross-column checks
β”œβ”€β”€ datasources/
β”‚   β”œβ”€β”€ csv_source.py      # CSV files
β”‚   β”œβ”€β”€ parquet_source.py  # Parquet files
β”‚   β”œβ”€β”€ database_source.py # SQL databases (SQLAlchemy)
β”‚   └── dataframe_source.py # Direct DataFrames
β”œβ”€β”€ profiler/
β”‚   └── profiler.py        # Auto-profiling & suggestion engine
β”œβ”€β”€ reporting/
β”‚   β”œβ”€β”€ html_report.py     # Production HTML reports
β”‚   └── json_report.py     # JSON reports
β”œβ”€β”€ config/
β”‚   └── loader.py          # YAML/JSON config loading
└── cli/
    └── main.py            # CLI (validate, run, profile, init, list-expectations)

πŸ§ͺ Testing

# Run all tests (66 tests)
pytest tests/ -v

# Run with coverage
pytest tests/ -v --cov=validatex --cov-report=html

# Unit tests only
pytest tests/unit/ -v

# Integration tests
pytest tests/integration/ -v

🀝 Creating Custom Expectations

from dataclasses import dataclass, field
from validatex.core.expectation import Expectation, register_expectation
from validatex.core.result import ExpectationResult

@register_expectation
@dataclass
class ExpectColumnValuesToBePositive(Expectation):
    """Expect all values in a numeric column to be positive."""

    expectation_type: str = field(
        init=False, default="expect_column_values_to_be_positive"
    )

    def _validate_pandas(self, df) -> ExpectationResult:
        series = df[self.column].dropna()
        total = len(series)
        negative_mask = series <= 0
        unexpected_count = int(negative_mask.sum())
        pct = (unexpected_count / total * 100) if total > 0 else 0.0

        return self._build_result(
            success=(unexpected_count == 0),
            element_count=total,
            unexpected_count=unexpected_count,
            unexpected_percent=pct,
            unexpected_values=series[negative_mask].tolist()[:20],
        )

🧹 Clean Output

ValidateX converts all internal types to native Python before rendering. You'll never see np.int64(20) in reports or JSON β€” only clean 20.

result = vx.validate(df, suite)
data = result.to_dict()

# Observed values are always clean:
# {'min': 20, 'max': 69}        ← NOT {'min': np.int64(20), ...}
# "Unique: 100/100 (100.0%)"    ← NOT "100 unique out of 100"
# "Distinct values: 3"          ← NOT "{'unique_values': 3}"

πŸš€ Roadmap

  • 25+ built-in expectations (column, table, aggregate)
  • Pandas + PySpark dual-engine support
  • Severity modeling (Critical / Warning / Info)
  • Weighted data quality score (0–100)
  • Column health summary with mini charts
  • Modern HTML reports with dark theme
  • Download buttons (JSON, CSV, clipboard)
  • Drift detection foundation
  • Data profiler with auto-suggestion
  • CLI with validate, profile, run, init commands
  • YAML/JSON declarative configuration
  • Native Python type sanitization
  • Slack / Teams notifications on failure
  • GitHub Action template for CI/CD
  • Polars engine support
  • Baseline history tracking & trend charts
  • Anomaly detection expectations
  • Great Expectations suite import/migration
  • Web dashboard for multi-dataset monitoring
  • dbt integration plugin

Versioning

ValidateX follows Semantic Versioning.

  • MAJOR version for incompatible API changes
  • MINOR version for backwards-compatible new functionality
  • PATCH version for backwards-compatible bug fixes

πŸ“„ License

MIT License


Built with ❀️ by the ValidateX Team
If this project helps you, consider giving it a ⭐

About

ValidateX is a lightweight, extensible data quality validation framework for Python that helps ensure dataset accuracy, consistency, and reliability with automated reporting and quality scoring.

Topics

Resources

License

Code of conduct

Contributing

Stars

Watchers

Forks

Packages

 
 
 

Contributors

Languages