DataTypical

Scientific Data Significance Rankings with Shapley Explanations

DataTypical analyzes datasets through three complementary lenses: archetypal (extreme), prototypical (representative), and stereotypical (target-like), with Shapley value explanations revealing why instances matter and which ones create your dataset's structure.

Key Features

Three Significance Types: Archetypal, prototypical, stereotypical (all computed simultaneously) Shapley Explanations: Feature-level attributions for why samples are significant
Formative Discovery: Distinguish samples that ARE significant from those that CREATE structure
Publication Visualizations: Dual-perspective scatter plots, heatmaps, and profile plots
Multi-Modal Support: Tabular data, text, and graph networks through unified API
Performance Optimized: Fast exploration mode and efficient Shapley computation

Quick Start

Installation ()

pip install datatypical

Basic Usage

from datatypical import DataTypical
from datatypical_viz import significance_plot, heatmap, profile_plot
import pandas as pd

# Load your data
data = pd.read_csv('your_data.csv')

# Analyze with explanations
dt = DataTypical(shapley_mode=True)
results = dt.fit_transform(data)

# Three significance perspectives (0-1 normalized ranks)
print(results[['archetypal_rank', 'prototypical_rank', 'stereotypical_rank']])

# Visualize: which samples are critical vs replaceable?
significance_plot(results, significance='archetypal')

# Understand: which features drive significance?
heatmap(dt, results, significance='archetypal', top_n=20)

# Explain: why is this sample significant?
top_idx = results['archetypal_rank'].idxmax()
profile_plot(dt, top_idx, significance='archetypal')

What DataTypical Does

Three Complementary Lenses

Lens	Finds	Use Cases
Archetypal	Extreme, boundary samples	Edge case discovery, outlier detection, range understanding
Prototypical	Representative, central samples	Dataset summarization, cluster centers, typical examples
Stereotypical	Target-similar samples	Optimization, goal-oriented selection, phenotype matching

The Power: All three computed simultaneously—different perspectives reveal different insights.

Dual Perspective (with Shapley)

When shapley_mode=True, DataTypical reveals two views:

Actual Significance (*_rank): Samples that ARE significant
Formative Significance (*_shapley_rank): Samples that CREATE the structure

Four Quadrants:

     Formative High
          │
  Gap     │  Critical
  Fillers │  (irreplaceable)
──────────┼──────────────── Actual High
Redundant │ Replaceable
          │  (keep one)
     Formative Low

This distinction—between what IS significant vs what CREATES structure—is a genuinely novel contribution.

Example: Drug Discovery

# Analyze compound library
dt = DataTypical(
    shapley_mode=True,
    stereotype_column='activity',  # Target property
    fast_mode=False
)
results = dt.fit_transform(compounds)

# Find critical compounds (high actual + high formative)
critical = results[
    (results['stereotypical_rank'] > 0.8) &
    (results['stereotypical_shapley_rank'] > 0.8)
]
print(f"Found {len(critical)} critical compounds")

# Find redundant compounds (high actual + low formative)
redundant = results[
    (results['stereotypical_rank'] > 0.8) &
    (results['stereotypical_shapley_rank'] < 0.3)
]
print(f"Found {len(redundant)} replaceable compounds")

# Understand alternative mechanisms
for idx in critical.index:
    profile_plot(dt, idx, significance='stereotypical')
    # Each shows different feature pattern → different mechanism

Discovery: Multiple structural pathways to high activity!

Performance

Speed Benchmarks

Dataset Size	Without Shapley	With Shapley
1,000 samples	~5 seconds	~5 minutes
10,000 samples	~30 seconds	~60 minutes

Optimization Strategy

Phase 1: Fast exploration (fast_mode=True, no Shapley)
↓ Identify interesting samples
Phase 2: Detailed analysis (shapley_mode=True, subset to interesting samples)
↓ Generate explanations and publication figures

Key Parameters

DataTypical(
    # Enable explanations and formative analysis
    shapley_mode=False,           # True for explanations
    
    # Speed vs accuracy
    fast_mode=True,               # False for publication quality
    
    # Significance types
    n_archetypes=8,               # Number of extreme corners
    n_prototypes=8,               # Number of representatives
    stereotype_column=None,       # Target column for stereotypical
    
    # Shapley optimization
    shapley_top_n=500,            # Limit explanations to top N
    shapley_n_permutations=100,   # Number of permutations (30 in fast_mode)
    
    # Reproducibility
    random_state=None,            # Set for reproducible results
    
    # Memory management
    max_memory_mb=8000            # Memory limit for operations
)

Visualization

Three Core Plots

from datatypical_viz import significance_plot, heatmap, profile_plot

# 1. Overview: Actual vs Formative scatter
significance_plot(results, significance='archetypal')

# 2. Feature patterns: Which features matter?
heatmap(dt, results, 
        significance='archetypal',
        order='actual',  # or 'formative'
        top_n=20)

# 3. Individual explanation: Why is this sample significant?
profile_plot(dt, sample_idx, 
             significance='archetypal',
             order='local')  # or 'global'

See VISUALIZATION_GUIDE.md for detailed interpretation.

Multi-Modal Support

Tabular Data (Default)

df = pd.DataFrame(...)
dt = DataTypical()
results = dt.fit_transform(df)

Text Data (Auto-Detected)

texts = ["document 1", "document 2", ...]
dt = DataTypical()
results = dt.fit_transform(texts)

Graph Networks (Protein Interactions, Molecules)

node_features = pd.DataFrame(...)
edges = [(0, 1), (1, 2), ...]
dt = DataTypical()
results = dt.fit_transform(node_features, edges=edges)

Use Cases

Scientific Discovery

Alternative mechanisms: Formative instances reveal different pathways
Boundary definition: Which samples define system limits
Quality control: Distinguish novel variation from known patterns
Coverage analysis: Identify sampling gaps

Dataset Curation

Size reduction: Remove redundant samples while preserving diversity
Representative selection: Choose samples spanning full space
Redundancy detection: Find clusters of similar samples
Gap identification: Locate undersampled regions

Model Understanding

Feature importance: Global and local significance patterns
Individual explanations: Why specific samples matter
Pattern recognition: Discover multiple pathways to outcomes
Interpretability: Explanations in original feature space

Documentation

New Users:

START_HERE.md - Friendly introduction and first steps
QUICK_REFERENCE.md - Daily reference for parameters and workflows
EXAMPLES.md - Complete worked examples across domains

Visualization:

VISUALIZATION_GUIDE.md - Comprehensive guide to plots and interpretation

Advanced:

INTERPRETATION_GUIDE.md - Interpreting complex patterns
COMPUTATION_GUIDE.md - Implementation details

Requirements

Python ≥ 3.8
NumPy ≥ 1.20
Pandas ≥ 1.3
SciPy ≥ 1.7
scikit-learn ≥ 1.0
Matplotlib ≥ 3.3
Seaborn ≥ 0.11
Numba ≥ 0.55 (for performance)

Citation

If you use DataTypical in your research, please cite:

@software{datatypical2026,
  author = {Barnard, Amanda S.},
  title = {DataTypical: Scientific Data Significance Rankings with Shapley Explanations},
  year = {2026},
  url = {https://github.com/amaxiom/DataTypical},
  version = {0.7}
}

What Makes DataTypical Different

From Traditional Methods

Outlier Detection: Only finds extremes → DataTypical finds extremes AND explains why

Clustering: Groups samples, picks centroids → DataTypical finds representatives maximizing coverage

Feature Selection: Ranks features → DataTypical explains which features matter for which samples

PCA/t-SNE: Projects to low dimensions → DataTypical maintains interpretability in original space

The Novel Contribution

Formative instances are genuinely new. The distinction between samples that ARE significant vs samples that CREATE structure emerges from the Shapley mechanism and enables:

Redundancy detection even among significant samples
Finding structurally important but non-extreme samples
Understanding irreplaceable vs interchangeable samples
Quality control based on structural contribution

This dual perspective transforms instance significance from pure ranking into causal understanding.

Development Status

Current Version: 0.7

Recent Updates:

Simplified visualization API (removed mode confusion)
Always-global feature ordering in heatmaps
Cleaned output (only rank columns)
Publication-ready boxed heatmaps
Improved memory management

Stability: Production-ready for research use

License

MIT License - See LICENSE for details.

Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED.

Support

Documentation: See docs/ folder or links above
Issues: Report bugs via GitHub Issues
Questions: Open a GitHub Discussion

Acknowledgments

DataTypical builds on foundational work in:

Archetypal analysis (Cutler & Breiman, 1994)
Facility location optimization (Nemhauser et al., 1978)
Shapley value theory (Shapley, 1953)
PCHA optimization (Mørup & Hansen, 2012)

Special thanks to the scientific Python community.

Quick Links

Documentation
Quick Start
Examples
Visualization Guide
Report Issues
Discussions

Ready to explore your data?

pip install datatypical

Then see START_HERE.md for your first analysis!

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
docs		docs
tests		tests
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
datatypical.py		datatypical.py
datatypical_viz.py		datatypical_viz.py
requirements.txt		requirements.txt
setup.py		setup.py

License

amaxiom/DataTypical

Folders and files

Latest commit

History

Repository files navigation

DataTypical

Key Features

Quick Start

Installation ()

Basic Usage

What DataTypical Does

Three Complementary Lenses

Dual Perspective (with Shapley)

Example: Drug Discovery

Performance

Speed Benchmarks

Optimization Strategy

Key Parameters

Visualization

Three Core Plots

Multi-Modal Support

Tabular Data (Default)

Text Data (Auto-Detected)

Graph Networks (Protein Interactions, Molecules)

Use Cases

Scientific Discovery

Dataset Curation

Model Understanding

Documentation

Requirements

Citation

What Makes DataTypical Different

From Traditional Methods

The Novel Contribution

Development Status

License

Support

Acknowledgments

Quick Links

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages