A powerful, extensible data quality validation framework for Python.
Badges represent (from left to right): CI/CD Build Status, Code Coverage, Test Count, Latest PyPI Release, Supported Python Versions, License, and Code Style.
ValidateX provides a comprehensive suite of tools for validating, profiling, and monitoring data quality across Pandas and PySpark DataFrames. Inspired by Great Expectations, it offers a simpler, more focused approach with modern, production-ready HTML reports and an intuitive API.
- πΌοΈ Report Preview
- π€ Why ValidateX?
- π― Who Is This For?
- β¨ Features
- π¦ Installation
- π Quick Start
- π€ Automate with CI/CD
- π― Data Quality Score
- π Available Expectations
- π Roadmap
Column Health Summary with mini bar charts |
Severity-tagged Expectations with human-readable output |
| Feature | ValidateX | Great Expectations |
|---|---|---|
| Setup | pip install β validate in 5 lines |
Multi-step setup with contexts & stores |
| API | Fluent, chainable Python API | Heavy config system |
| Severity levels | β (Critical, Warning, Info) | β |
| Quality score | β (Weighted 0β100) | β |
| Auto-suggest expectations | β | β |
| Reports | Modern dark-theme HTML with minicharts | Basic data docs |
| Output Data Types | Clean native Python types | NumPy types leak into JSON |
| PySpark Support | β | β |
| Polars Support | Soon | β |
| CI/CD friendly CLI | β | β |
| Downloads | JSON / CSV / clipboard built into report | Separate export |
| Learning curve | Minutes | Hours to days |
ValidateX is not a replacement for Great Expectations β it's a focused alternative for teams that want production-grade data validation without the overhead.
- Startup data teams β Ship data quality checks in minutes, not days
- ML engineers β Validate feature stores and training data before model runs
- CI/CD pipelines β Gate deployments on data quality with a single CLI command
- Analytics teams β Catch data issues before they reach dashboards
- dbt users β Lightweight validation alongside your transformation layer
- Data platform teams β Monitor data quality across dozens of tables
| Feature | Description |
|---|---|
| 25+ Built-in Expectations | Column-level, table-level, and aggregate validations |
| Dual Engine Support | Pandas and PySpark execution engines |
| π― Data Quality Score | Weighted score (0β100) based on severity of checks |
| π΄π‘π΅ Severity Levels | Critical / Warning / Info classification for every expectation |
| π Column Health Summary | At-a-glance per-column health with mini bar charts |
| Modern HTML Reports | Stunning, self-contained dark-theme reports with animations |
| π₯ Download Buttons | Export reports as JSON, CSV, or copy summary to clipboard |
| π Drift Detection | Track changes between validation runs |
| Data Profiling | Auto-analyse datasets and suggest expectations |
| YAML/JSON Config | Define expectations declaratively |
| CLI Interface | Run validations from the command line |
| Checkpoint System | Tie data sources and suites together |
| Extensible | Create custom expectations with the registry pattern |
| Clean Output | All values are native Python types β zero NumPy leakage |
# Basic install
pip install validatex
# With PySpark support
pip install "validatex[spark]"
# With database support
pip install "validatex[database]"
# Full install
pip install "validatex[all]"
# Development
pip install "validatex[dev]"import pandas as pd
import validatex as vx
# Create your data
df = pd.DataFrame({
"user_id": [1, 2, 3, 4, 5],
"name": ["Alice", "Bob", "Charlie", "Diana", "Eve"],
"age": [25, 30, 35, 28, 42],
"email": ["alice@test.com", "bob@test.com", "charlie@test.com",
"diana@test.com", "eve@test.com"],
"status": ["active", "active", "inactive", "active", "pending"],
})
# Build an expectation suite
suite = (
vx.ExpectationSuite("user_quality")
.add("expect_column_to_not_be_null", column="user_id")
.add("expect_column_values_to_be_unique", column="user_id")
.add("expect_column_values_to_be_between", column="age", min_value=0, max_value=150)
.add("expect_column_values_to_be_in_set",
column="status", value_set=["active", "inactive", "pending"])
.add("expect_column_values_to_match_regex",
column="email", regex=r"^[\w.]+@[\w]+\.\w+$")
)
# Validate
result = vx.validate(df, suite)
# Print summary (includes Quality Score)
print(result.summary())
# Generate reports
result.to_html("report.html")
result.to_json_file("report.json")# Initialize a project
validatex init
# Profile a dataset
validatex profile --data data.csv --suggest --output auto_suite.yaml
# Run validation
validatex validate --data data.csv --suite suite.yaml --report report.html
# Run checkpoint
validatex run --checkpoint checkpoint.yaml
# List available expectations
validatex list-expectationsValidateX is designed to be lightweight and CI-friendly. You can easily integrate it into your GitHub Actions, GitLab CI, or Jenkins pipelines to gate deployments on data quality.
Example: GitHub Actions
name: Data Quality Validation
on: [push, pull_request]
jobs:
validate-data:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- name: Set up Python
uses: actions/setup-python@v5
with:
python-version: '3.11'
- name: Install ValidateX
run: pip install validatex
- name: Run Data Validation
run: |
validatex validate \
--data data/production_data.csv \
--suite tests/data_quality/suite.yaml \
--report dq_report.html
- name: Archive production artifacts
uses: actions/upload-artifact@v4
if: always()
with:
name: validatex-report
path: dq_report.htmlValidateX computes a weighted quality score (0β100) based on the severity of each expectation:
| Severity | Weight | Example Expectations |
|---|---|---|
| π΄ Critical | Γ3 | Null checks, uniqueness, column existence, row count |
| π‘ Warning | Γ2 | Range checks, set membership, regex, type checks |
| π΅ Info | Γ1 | Mean/stdev bounds, string lengths, distinct values |
Formula: Score = 100 Γ (weighted_passed / weighted_total)
A critical failure impacts the score 3Γ more than an info-level check. This gives decision-makers a single number to assess data health.
result = vx.validate(df, suite)
score = result.compute_quality_score()
print(f"Data Quality Score: {score}/100")Override the default severity on any expectation via meta:
expectations:
- expectation_type: expect_column_mean_to_be_between
column: revenue
kwargs:
min_value: 1000
max_value: 50000
meta:
severity: critical # Override default "info" β "critical"The HTML report includes a Column Health Summary that aggregates all expectations per column:
| Column | Checks | Passed | Failed | Health | Null % | Unique % |
|---|---|---|---|---|---|---|
| user_id | 3 | 3 | 0 | 100% βββ | 0.0% | 100.0% βββ |
| 4 | 4 | 0 | 100% βββ | 0.0% | 100.0% βββ | |
| status | 1 | 1 | 0 | 100% βββ | β | β |
Each metric includes a mini CSS bar chart for instant visual scanning.
for col in result.column_health():
print(f"{col.column}: {col.health_score}% health, "
f"{col.passed}/{col.checks} passed")| Expectation | Severity | Description |
|---|---|---|
expect_column_to_exist |
π΄ Critical | Column exists in DataFrame |
expect_column_to_not_be_null |
π΄ Critical | No null values |
expect_column_values_to_be_unique |
π΄ Critical | All values unique |
expect_column_values_to_be_between |
π‘ Warning | Values within range |
expect_column_values_to_be_in_set |
π‘ Warning | Values in allowed set |
expect_column_values_to_not_be_in_set |
π‘ Warning | Values not in forbidden set |
expect_column_values_to_match_regex |
π‘ Warning | Values match regex pattern |
expect_column_values_to_be_of_type |
π‘ Warning | Column dtype matches |
expect_column_values_to_be_dateutil_parseable |
π‘ Warning | Values parseable as dates |
expect_column_value_lengths_to_be_between |
π΅ Info | String lengths within range |
expect_column_max_to_be_between |
π΅ Info | Column max within bounds |
expect_column_min_to_be_between |
π΅ Info | Column min within bounds |
expect_column_mean_to_be_between |
π΅ Info | Column mean within bounds |
expect_column_stdev_to_be_between |
π΅ Info | Column std dev within bounds |
expect_column_distinct_values_to_be_in_set |
π΅ Info | All distinct values in set |
expect_column_proportion_of_unique_values_to_be_between |
π΅ Info | Uniqueness ratio in range |
| Expectation | Severity | Description |
|---|---|---|
expect_table_row_count_to_equal |
π΄ Critical | Exact row count |
expect_table_row_count_to_be_between |
π΄ Critical | Row count in range |
expect_table_columns_to_match_ordered_list |
π΄ Critical | Column order matches |
expect_table_columns_to_match_set |
π΄ Critical | Column names match (unordered) |
expect_table_column_count_to_equal |
π΄ Critical | Exact column count |
| Expectation | Severity | Description |
|---|---|---|
expect_column_pair_values_a_to_be_greater_than_b |
π‘ Warning | Column A > Column B |
expect_column_pair_values_to_be_equal |
π‘ Warning | Two columns equal |
expect_multicolumn_sum_to_equal |
π‘ Warning | Row-wise sum equals target |
expect_compound_columns_to_be_unique |
π΄ Critical | Compound key uniqueness |
import pandas as pd
from validatex import DataProfiler
df = pd.read_csv("data.csv")
profiler = DataProfiler()
# Profile
profile = profiler.profile(df)
print(profile.summary())
# Auto-suggest expectations
suite = profiler.suggest_expectations(df, suite_name="auto_suite")
suite.save("auto_suite.yaml")suite_name: my_data_quality
meta:
description: "Quality checks for production data"
expectations:
- expectation_type: expect_column_to_not_be_null
column: id
meta:
severity: critical
- expectation_type: expect_column_values_to_be_between
column: age
kwargs:
min_value: 0
max_value: 150
- expectation_type: expect_column_values_to_be_in_set
column: status
kwargs:
value_set: ["active", "inactive"]validatex/
βββ core/
β βββ expectation.py # Base class + registry
β βββ result.py # ValidationResult, QualityScore, Severity, ColumnHealth
β βββ suite.py # ExpectationSuite (fluent API)
β βββ validator.py # Validation orchestrator
βββ expectations/
β βββ column_expectations.py # 16 column-level checks
β βββ table_expectations.py # 5 table-level checks
β βββ aggregate_expectations.py # 4 cross-column checks
βββ datasources/
β βββ csv_source.py # CSV files
β βββ parquet_source.py # Parquet files
β βββ database_source.py # SQL databases (SQLAlchemy)
β βββ dataframe_source.py # Direct DataFrames
βββ profiler/
β βββ profiler.py # Auto-profiling & suggestion engine
βββ reporting/
β βββ html_report.py # Production HTML reports
β βββ json_report.py # JSON reports
βββ config/
β βββ loader.py # YAML/JSON config loading
βββ cli/
βββ main.py # CLI (validate, run, profile, init, list-expectations)
# Run all tests (66 tests)
pytest tests/ -v
# Run with coverage
pytest tests/ -v --cov=validatex --cov-report=html
# Unit tests only
pytest tests/unit/ -v
# Integration tests
pytest tests/integration/ -vfrom dataclasses import dataclass, field
from validatex.core.expectation import Expectation, register_expectation
from validatex.core.result import ExpectationResult
@register_expectation
@dataclass
class ExpectColumnValuesToBePositive(Expectation):
"""Expect all values in a numeric column to be positive."""
expectation_type: str = field(
init=False, default="expect_column_values_to_be_positive"
)
def _validate_pandas(self, df) -> ExpectationResult:
series = df[self.column].dropna()
total = len(series)
negative_mask = series <= 0
unexpected_count = int(negative_mask.sum())
pct = (unexpected_count / total * 100) if total > 0 else 0.0
return self._build_result(
success=(unexpected_count == 0),
element_count=total,
unexpected_count=unexpected_count,
unexpected_percent=pct,
unexpected_values=series[negative_mask].tolist()[:20],
)ValidateX converts all internal types to native Python before rendering. You'll never see np.int64(20) in reports or JSON β only clean 20.
result = vx.validate(df, suite)
data = result.to_dict()
# Observed values are always clean:
# {'min': 20, 'max': 69} β NOT {'min': np.int64(20), ...}
# "Unique: 100/100 (100.0%)" β NOT "100 unique out of 100"
# "Distinct values: 3" β NOT "{'unique_values': 3}"- 25+ built-in expectations (column, table, aggregate)
- Pandas + PySpark dual-engine support
- Severity modeling (Critical / Warning / Info)
- Weighted data quality score (0β100)
- Column health summary with mini charts
- Modern HTML reports with dark theme
- Download buttons (JSON, CSV, clipboard)
- Drift detection foundation
- Data profiler with auto-suggestion
- CLI with validate, profile, run, init commands
- YAML/JSON declarative configuration
- Native Python type sanitization
- Slack / Teams notifications on failure
- GitHub Action template for CI/CD
- Polars engine support
- Baseline history tracking & trend charts
- Anomaly detection expectations
- Great Expectations suite import/migration
- Web dashboard for multi-dataset monitoring
- dbt integration plugin
ValidateX follows Semantic Versioning.
- MAJOR version for incompatible API changes
- MINOR version for backwards-compatible new functionality
- PATCH version for backwards-compatible bug fixes
MIT License
Built with β€οΈ by the ValidateX Team
If this project helps you, consider giving it a β


