Service: secure-ai-gpu-integrity-watch.service
Binary: /usr/libexec/secure-ai/gpu-integrity-watch
Port: 8495 (loopback only)
Language: Go
Continuous GPU runtime integrity verification. Monitors the GPU hardware and driver stack to detect tampering, unexpected changes, or anomalies that could compromise model execution trust. Integrates with the runtime attestor and incident recorder for end-to-end GPU security.
GPU Integrity Watch runs as a daemon that periodically probes the GPU subsystem and scores results against a trusted baseline. If the score exceeds a configurable threshold, it triggers degradation actions and reports incidents.
+-----------+ +-----------+ +-----------+ +-------------------+
| Probes | --> | Scoring | --> | Actions | --> | Integrations |
| (6 types) | | (weighted | | (degrade, | | - incident-recorder|
| | | history) | | alert, | | - runtime-attestor |
| | | | | disable) | | |
+-----------+ +-----------+ +-----------+ +-------------------+
| Probe |
Type |
Default Weight |
What It Checks |
| Tensor Hash |
tensor_hash |
1.0 |
SHA-256 of model files vs baseline |
| Sentinel Inference |
sentinel_inference |
1.0 |
Known input/output pairs for behavioral consistency |
| Reference Drift |
reference_drift |
0.8 |
Multi-pass variance detection (corruption signature) |
| ECC Status |
ecc_status |
0.6 |
GPU memory error counters (nvidia-smi) |
| Driver Fingerprint |
driver_fingerprint |
1.0 |
GPU driver version + kernel module identity vs baseline |
| Device Allowlist |
device_allowlist |
0.8 |
GPU device nodes (/dev/dri/, /dev/nvidia) vs expected list |
| Composite Score |
Verdict |
| 0.0 - 0.3 |
healthy |
| 0.3 - 0.9 |
warning |
>= 0.9 or any probe fail |
critical |
The /v1/attest-state endpoint returns a GPUAttestState summary that the runtime attestor can include in the signed attestation bundle:
{
"timestamp": "2026-03-13T12:00:00Z",
"verdict": "healthy",
"composite_score": 0.0,
"probe_statuses": {"hash": "pass", "driver": "pass"},
"driver_version": "565.57.01",
"device_nodes": ["/dev/dri/card0", "/dev/dri/renderD128"],
"trend": 0.0
}
On warning or critical verdicts, GPU Integrity Watch automatically reports incidents to the incident-recorder service (http://127.0.0.1:8515). Incident classes are mapped from probe failures:
| Probe Failure |
Incident Class |
Severity |
| Tensor hash fail |
manifest_mismatch |
critical |
| ECC uncorrected errors |
integrity_violation |
critical |
| Driver fingerprint change |
integrity_violation |
high |
| Device allowlist fail |
integrity_violation |
high |
| Other anomalies |
model_behavior_anomaly |
high |
- Profile:
/etc/secure-ai/gpu-integrity/default-profile.yaml
- Baseline:
/var/lib/secure-ai/gpu-integrity/baseline.yaml
- Audit log:
/var/lib/secure-ai/logs/gpu-integrity-audit.jsonl
| Variable |
Default |
Description |
INTEGRITY_PROFILE |
profiles/default-profile.yaml |
Profile YAML path |
SERVICE_TOKEN |
(none) |
Bearer token for protected endpoints |
AUDIT_LOG |
(none) |
JSONL audit log path |
INCIDENT_RECORDER_URL |
(from profile) |
Override incident-recorder URL |
gpu-integrity-watch check # Run probes once, exit 0/1/2
gpu-integrity-watch watch # Continuous foreground monitoring
gpu-integrity-watch daemon # HTTP daemon + background monitoring
gpu-integrity-watch baseline # Capture baseline hashes
gpu-integrity-watch status # Query daemon status
| Action |
Type |
Trigger |
Effect |
| Alert |
alert |
warning |
Send webhook or log alert |
| Reload |
reload |
warning |
Signal inference server to reload model |
| Quarantine |
quarantine |
critical |
Move model files to quarantine directory |
| Fail Closed |
fail_closed |
critical |
Shut down inference server |
| Method |
Path |
Auth |
Description |
| GET |
/health |
No |
Liveness check |
| POST |
/v1/check |
No |
Trigger full probe cycle |
| GET |
/v1/status |
No |
Latest verdict, trend, probes, actions |
| GET |
/v1/history |
No |
Score history array |
| GET |
/v1/metrics |
No |
Counter metrics |
| GET |
/v1/attest-state |
No |
GPU attestation state for runtime-attestor |
| POST |
/v1/baseline |
Token |
Recapture baseline from model directory |
| POST |
/v1/reload |
Token |
Reload profile and baseline from disk |
| Mechanism |
Setting |
| Dynamic user |
DynamicUser=yes |
| Filesystem |
ProtectSystem=strict, ProtectHome=yes |
| Network |
RestrictAddressFamilies=AF_UNIX AF_INET, localhost only |
| Capabilities |
CapabilityBoundingSet= (empty) |
| Memory |
MemoryDenyWriteExecute=yes |
| Seccomp |
Custom seccomp-BPF profile |
| Landlock |
Read: /etc/secure-ai, /sys/class/drm, /sys/bus/pci/devices, /dev/dri; Write: /var/lib/secure-ai/logs, /var/lib/secure-ai/gpu-integrity |
81 tests covering:
- Tensor hash probes (5 tests)
- Sentinel inference/drift probes (3 tests)
- ECC status parsing (5 tests)
- Similarity computation (4 tests)
- Scoring engine (7 tests)
- Action execution (5 tests)
- Integration pipeline (2 tests)
- HTTP endpoints (10 tests)
- Token authentication (3 tests)
- Driver fingerprint probes (5 tests)
- Device allowlist probes (5 tests)
- Attestation state building (3 tests)
- Incident classification (4 tests)
- New probe integration (2 tests)
- Scoring with new weights (1 test)
cd services/gpu-integrity-watch && go test -v -race ./...