safedata is built as a modular, defense-in-depth filtering stack for training data safety.
Every stage emits structured audit events (jsonl) containing input id, detector outputs, actions taken, and confidence values.
Alignment rationale: models trained on filtered data should have traceable provenance. If alignment behaviors degrade, maintainers need evidence for why a sample was included, redacted, or removed.
Classifiers output probabilities with confidence calibration hooks (temperature scaling and reliability diagnostics) instead of hard labels only.
Alignment rationale: uncertainty-aware filtering reduces overconfident failure modes and supports policy-level risk budgets.
Red-team attack generators (character swaps, homoglyph substitutions, label-flip/backdoor perturbations) are first-class modules.
Alignment rationale: safety controls must be evaluated under adaptive pressure, not only clean benchmarks.
Bias evaluation components measure demographic representation ratios and counterfactual sensitivity, with optional Fairlearn metrics.
Alignment rationale: data filtering itself can introduce representational harms; fairness telemetry should be always-on, not afterthought.
The FilterPipeline composes cheap deterministic checks (blocklists, dedup) with statistical detectors (toxicity/PII/poisoning). No single detector acts as a sole gatekeeper.
Alignment rationale: single-point safety controls are brittle and easier to evade. Layered controls reduce correlated failures.
classifiers/: detector primitives with explicit outputs and confidence.filters/: orchestration/cascade and thresholding policy.evaluation/: calibration, robustness, fairness, and safety metrics.attacks/: red-team perturbation tooling.utils/: data loaders, configuration, and auditable logging.
- Optional heavy dependencies keep local development lightweight but produce variable detector quality across environments.
- Heuristic fallback logic improves operability but should be replaced with validated models for high-risk deployments.
- Online adaptation is intentionally excluded to avoid silent policy drift.