Skip to content

Latest commit

 

History

History
45 lines (28 loc) · 2.33 KB

File metadata and controls

45 lines (28 loc) · 2.33 KB

Architecture and Alignment Rationale

safedata is built as a modular, defense-in-depth filtering stack for training data safety.

Core design principles

1) Transparency

Every stage emits structured audit events (jsonl) containing input id, detector outputs, actions taken, and confidence values.

Alignment rationale: models trained on filtered data should have traceable provenance. If alignment behaviors degrade, maintainers need evidence for why a sample was included, redacted, or removed.

2) Calibrated uncertainty

Classifiers output probabilities with confidence calibration hooks (temperature scaling and reliability diagnostics) instead of hard labels only.

Alignment rationale: uncertainty-aware filtering reduces overconfident failure modes and supports policy-level risk budgets.

3) Adversarial mindset

Red-team attack generators (character swaps, homoglyph substitutions, label-flip/backdoor perturbations) are first-class modules.

Alignment rationale: safety controls must be evaluated under adaptive pressure, not only clean benchmarks.

4) Fairness by design

Bias evaluation components measure demographic representation ratios and counterfactual sensitivity, with optional Fairlearn metrics.

Alignment rationale: data filtering itself can introduce representational harms; fairness telemetry should be always-on, not afterthought.

5) Defense in depth

The FilterPipeline composes cheap deterministic checks (blocklists, dedup) with statistical detectors (toxicity/PII/poisoning). No single detector acts as a sole gatekeeper.

Alignment rationale: single-point safety controls are brittle and easier to evade. Layered controls reduce correlated failures.

System decomposition

  • classifiers/: detector primitives with explicit outputs and confidence.
  • filters/: orchestration/cascade and thresholding policy.
  • evaluation/: calibration, robustness, fairness, and safety metrics.
  • attacks/: red-team perturbation tooling.
  • utils/: data loaders, configuration, and auditable logging.

Tradeoffs

  • Optional heavy dependencies keep local development lightweight but produce variable detector quality across environments.
  • Heuristic fallback logic improves operability but should be replaced with validated models for high-risk deployments.
  • Online adaptation is intentionally excluded to avoid silent policy drift.