🛡️ Dual-Deployment Resilience Framework
🎯 Enterprise-Grade Availability Through Geographic Redundancy
📋 Document Owner: CEO | 📄 Version: 1.0 | 📅 Last Updated: 2026-02-10 (UTC)
🔄 Review Cycle: Quarterly | ⏰ Next Review: 2026-05-10
📌 Classification: Public
Riksdagsmonitor's business continuity framework demonstrates how geographic redundancy and automated failover directly enable operational resilience and service availability. Our dual-deployment strategy serves as both operational necessity and technical demonstration of enterprise-grade reliability principles.
This plan is designed to maintain the riksdagsmonitor.com platform during infrastructure disruptions through AWS multi-region deployment (primary) and GitHub Pages disaster recovery (standby), targeting 99.998% availability under normal operating conditions, with CloudFront origin failover typically completing in under 60 seconds and DNS/Route 53 failover (including health checks and DNS propagation) completing within approximately 15 minutes during full-region incidents.
— James Pether Sörling, CEO/Founder
Riksdagsmonitor provides public political transparency services requiring high availability but tolerating brief disruptions:
%%{
init: {
'theme': 'base',
'themeVariables': {
'primaryColor': '#1565C0',
'primaryTextColor': '#0d47a1',
'lineColor': '#1565C0',
'secondaryColor': '#4CAF50',
'tertiaryColor': '#FF9800'
}
}
}%%
graph TB
subgraph BIA["📊 Business Impact Analysis"]
FINANCIAL[💰 Financial Impact<br/>No direct revenue loss]
OPERATIONAL[⚙️ Operational Impact<br/>Service unavailable]
REPUTATIONAL[🤝 Reputational Impact<br/>Public trust in transparency]
CIVIC[🏛️ Civic Impact<br/>Democratic accountability]
end
subgraph RECOVERY["🔄 Recovery Requirements"]
RTO[⏰ RTO Target<br/>< 30 seconds origin failover<br/>< 15 minutes DNS failover]
RPO[💾 RPO Target<br/>< 15 minutes<br/>near-zero effective RPO (S3 replication lag)]
AVAILABILITY[📈 Availability Target<br/>99.998%<br/>≈10.5 minutes (~631 seconds) downtime/year]
end
subgraph DEPLOYMENT["🌍 Deployment Strategy"]
PRIMARY[☁️ AWS Primary<br/>CloudFront + S3 Multi-Region]
DR[📝 GitHub Pages DR<br/>Standby Deployment]
FAILOVER[🔄 Automatic Failover<br/>Route 53 Health Checks]
end
FINANCIAL --> RTO
OPERATIONAL --> RTO
REPUTATIONAL --> RPO
CIVIC --> AVAILABILITY
RTO --> PRIMARY
RPO --> PRIMARY
AVAILABILITY --> PRIMARY
PRIMARY --> FAILOVER
DR --> FAILOVER
style BIA fill:#1565C0
style RECOVERY fill:#FF9800
style DEPLOYMENT fill:#4CAF50
%%{
init: {
'theme': 'base',
'themeVariables': {
'primaryColor': '#1565C0',
'primaryTextColor': '#0d47a1',
'lineColor': '#1565C0',
'secondaryColor': '#4CAF50',
'tertiaryColor': '#FF9800'
}
}
}%%
graph TB
subgraph ROUTE53["🌐 Route 53 DNS"]
DNS[📡 DNS Service<br/>Health Checks Every 30s]
HEALTHCHECK[⚕️ Health Checker<br/>Tests CloudFront Endpoint]
end
subgraph PRIMARY["☁️ AWS Primary (Active)"]
CF[🌍 CloudFront CDN<br/>600+ PoPs<br/>Automatic Origin Failover]
S3_US[💾 S3 us-east-1<br/>Primary Origin<br/>Versioning Enabled]
S3_EU[💾 S3 eu-west-1<br/>Replica Origin<br/>Asynchronous Replication (<15 min RPO)]
CF -->|Primary| S3_US
CF -->|Failover on 5xx errors| S3_EU
S3_US -.->|Replication| S3_EU
end
subgraph DR["📝 GitHub Pages (Standby)"]
GH[📄 GitHub Pages<br/>Default branch (root)<br/>Automated Deployment]
end
USERS[👥 Users] -->|DNS Query| DNS
HEALTHCHECK -->|Monitor| CF
DNS -->|Healthy: Return CloudFront alias/hostname| USERS
DNS -.->|3 Failed Checks (~90s detection)<br/>+ DNS TTL/propagation (up to ~15 min total)| USERS
USERS -->|HTTPS/TLS 1.3| CF
USERS -.->|HTTPS/TLS 1.3 (DR)| GH
style ROUTE53 fill:#1565C0
style PRIMARY fill:#4CAF50
style DR fill:#FF9800
These are business continuity design objectives, not contractual guarantees. Availability figures are based on underlying cloud provider SLAs and documented reliability targets.
| Component | Provider SLA | Failover Mechanism | Target RTO | Target RPO | Notes |
|---|---|---|---|---|---|
| 🌍 CloudFront | 99.9% (AWS SLA) | Origin failover | < 30 seconds | ≈ 0 minutes | Cache may serve slightly stale content during failover |
| 💾 S3 us-east-1 | 99.99% (AWS SLA) | Multi-region replica | < 30 seconds | < 15 minutes | S3 cross-region replication typically completes within minutes; static content allows near-zero effective RPO |
| 💾 S3 eu-west-1 | 99.99% (AWS SLA) | Primary failback | < 30 seconds | < 15 minutes | Replication lag possible; static content minimizes data loss impact |
| 🌐 Route 53 | 100% (AWS SLA) | Health check failover (30s × 3 checks) | 15 minutes | ≈ 0 minutes | Includes health check detection (90s) + DNS TTL propagation (~14 min) |
| 📝 GitHub Pages | 99.9% (target; no formal SLA) | Route 53 automated DNS failover | 15 minutes | Up to last deployment | Static content served via Route 53 health-check based DNS failover; RPO = time since last successful GitHub Actions deploy |
| 🎯 Combined | Design target ≈ 99.998% | Automated multi-layer | < 30 seconds (objective) | < 15 minutes for static content (objective) | Theoretical calculation assuming largely independent failures |
Disclaimer: These are business continuity design objectives based on AWS published SLAs (CloudFront 99.9%, S3 99.99%, Route 53 100%) and GitHub public reliability targets. The combined 99.998% availability is a theoretical design target assuming largely independent failures. Actual end-to-end availability may be lower in practice. RPO values reflect S3 cross-region replication characteristics (typically < 15 minutes) and static content deployment timing; actual RPO may vary.
🔍 Detection:
- CloudFront origin monitoring detects 500+ HTTP errors from us-east-1
- Automatic failover triggered without manual intervention
🔄 Recovery Procedure:
- ⚡ CloudFront automatically routes to S3 eu-west-1 origin (< 30 seconds)
- 📊 Verify service availability via monitoring
- 📝 Log incident for post-event analysis
- ⏳ Monitor AWS status for us-east-1 restoration
- 🔙 Automatic failback when us-east-1 recovers
✅ Validation:
- Service availability confirmed via health checks
- User experience unaffected (transparent failover)
- Content served from eu-west-1 (identical to us-east-1)
🔍 Detection:
- Route 53 health checks fail for CloudFront endpoint
- Automated DNS failover to GitHub Pages after health check detection + DNS propagation (≈ 15 minutes total)
🔄 Recovery Procedure:
- ⚕️ Route 53 detects CloudFront health check failures (30s intervals × 3 failures = 90 seconds detection time)
- 🌐 DNS automatically updates riksdagsmonitor.com → GitHub Pages
- 📊 Verify GitHub Pages serving traffic
- 📧 Notify CEO of failover event
- ⏳ Monitor CloudFront status for restoration
- 🔙 Intentionally manual DNS failback after CloudFront recovery and stability confirmation
- Rationale: Failback is manual by design to avoid DNS flapping and ensure human verification before restoring CloudFront as primary
✅ Validation:
- GitHub Pages availability confirmed
- Users redirected via DNS (up to 15-minute TTL)
- Content identical (synchronized deployment)
🔍 Detection:
- CloudFront cannot reach either S3 origin
- Route 53 health checks fail
🔄 Recovery Procedure:
- ⚡ CloudFront attempts origin failover (< 30 seconds)
- 🌐 Route 53 DNS failover to GitHub Pages (15 minutes)
- 📊 Verify GitHub Pages serving traffic
- 📧 CEO notification of major AWS outage
- ⏳ Monitor AWS status dashboard
- 🔙 DNS failback after AWS recovery
✅ Validation:
- Service restored via GitHub Pages
- Incident documented with AWS service disruption details
🔍 Detection:
- CloudTrail alerts for unauthorized API calls
- GuardDuty security findings
- Unexpected configuration changes
🔄 Recovery Procedure:
- 🔒 Immediate DNS failover to GitHub Pages (operator action: 2 minutes; client-visible cutover: up to DNS TTL propagation ~15 minutes)
- 🔐 Revoke all AWS IAM credentials and access keys
- 🔄 Update AWS IAM role trust policy for GitHub Actions OIDC provider to revoke compromised trust
- 📊 CloudTrail audit of unauthorized actions
- 🛡️ AWS Support engagement for forensics
- 🔧 Restore infrastructure from documented configuration and backups (future-state: Infrastructure-as-Code)
- ✅ Security validation before DNS failback
✅ Validation:
- Service operational on GitHub Pages
- All compromised credentials revoked
- Forensic analysis completed
- Infrastructure hardened before restoration
🔍 Detection:
- GitHub Pages deployment failure
- Health checks fail for GitHub Pages endpoint
🔄 Recovery Procedure:
- 📊 Verify GitHub status dashboard
- 🌐 If AWS available, revert DNS to CloudFront immediately
- 📄 If both unavailable, deploy to alternative CDN (Cloudflare Pages, Netlify)
- 📦 Build static site from Git main branch
- 🌐 Update DNS to alternative CDN
- 🔙 Restore to primary after AWS/GitHub recovery
✅ Validation:
- Alternative deployment confirmed operational
- DNS propagation verified
- Incident escalated to GitHub Support
👨💼 CEO (James Pether Sörling) - Business Continuity Coordinator
- 🔑 Authority: Full decision-making power for continuity actions
- 🎯 Responsibilities: Strategic decisions, stakeholder communication, recovery coordination
- 📞 Contact: Primary mobile, backup email, monitoring alerts
- 🛠️ Tools: AWS Console, GitHub CLI, Route 53 DNS management, CloudWatch
🔧 Technical Recovery (CEO as Technical Lead)
- 🎯 Responsibilities: AWS infrastructure, GitHub Pages, DNS failover, health check monitoring
- 🛠️ Tools: AWS Console, AWS CLI, GitHub Actions, Route 53, CloudWatch
- 📞 Escalation Paths: AWS Enterprise Support, GitHub Enterprise Support
| 👤 Role | 📞 Primary Contact | 🔄 Backup Method | ⏰ Response Time |
|---|---|---|---|
| 👨💼 CEO/Coordinator | 📱 Mobile phone | 📧 Email + SMS | < 15 minutes |
| ☁️ AWS Support | 🌐 Enterprise Portal | 📞 Phone support | < 15 minutes |
| 📝 GitHub Support | 🌐 Enterprise Portal | < 1 hour | |
| 🌐 Route 53 Operations | ☁️ AWS Console | 📱 Mobile app | < 5 minutes |
| 📊 Monitoring Alerts | 📧 Email + 📱 SMS | 💬 Chat/IM | Real-time |
- 📊 Assess Situation: Determine scope via CloudWatch, Route 53 health checks
- 🔍 Identify Failure Point: AWS infrastructure, DNS, GitHub Pages
- 🚀 Activate Recovery: Automatic (CloudFront failover) or manual (DNS update)
- 📢 Log Incident: Document detection time, symptoms, actions taken
- 📧 Stakeholder Notification: CEO notification via monitoring alerts
%%{
init: {
'theme': 'base',
'themeVariables': {
'primaryColor': '#1565C0',
'primaryTextColor': '#0d47a1',
'lineColor': '#1565C0',
'secondaryColor': '#4CAF50',
'tertiaryColor': '#FF9800'
}
}
}%%
graph TD
INCIDENT[🚨 Service Disruption Detected] --> CHECK_CF{CloudFront<br/>Accessible?}
CHECK_CF -->|No| MANUAL_DNS[🌐 Manual DNS Failover<br/>to GitHub Pages<br/>RTO: 2 minutes]
CHECK_CF -->|Yes| CHECK_S3{S3 Origins<br/>Accessible?}
CHECK_S3 -->|us-east-1 No| AUTO_FAILOVER[⚡ Automatic Origin Failover<br/>to eu-west-1<br/>RTO: < 30 seconds]
CHECK_S3 -->|Both No| ROUTE53_FAILOVER[⚕️ Route 53 Health Check<br/>DNS Failover<br/>RTO: 15 minutes]
CHECK_S3 -->|Yes| CHECK_HEALTH{Health Check<br/>Passing?}
CHECK_HEALTH -->|No| INVESTIGATE[🔍 Investigate Root Cause<br/>Application Error?<br/>Configuration Issue?]
CHECK_HEALTH -->|Yes| FALSE_ALARM[✅ False Alarm<br/>Monitor and Document]
MANUAL_DNS --> VERIFY[✅ Verify Service Restored]
AUTO_FAILOVER --> VERIFY
ROUTE53_FAILOVER --> VERIFY
INVESTIGATE --> VERIFY
VERIFY --> DOCUMENT[📝 Incident Documentation<br/>Post-Event Analysis]
style INCIDENT fill:#FF9800
style MANUAL_DNS fill:#1565C0
style AUTO_FAILOVER fill:#4CAF50
style ROUTE53_FAILOVER fill:#1565C0
style VERIFY fill:#4CAF50
| Test Type | Frequency | Scope | Success Criteria |
|---|---|---|---|
| ⚡ Origin Failover Test | Quarterly | CloudFront → S3 eu-west-1 | Failover < 30 seconds, no data loss |
| 🌐 DNS Failover Test | Semi-Annual | Route 53 → GitHub Pages | Failover within 15 minutes, content identical |
| 🔙 Failback Test | Quarterly | Return to primary infrastructure | Clean restoration, no errors |
| 📊 Monitoring Alert Test | Monthly | CloudWatch, Route 53 health checks | Alerts delivered within 5 minutes |
| 📋 Recovery Runbook Test | Quarterly | Execute documented procedures | All steps executable, documentation accurate |
| 🔐 Security Incident Drill | Annual | AWS account compromise scenario | Credentials revoked, service restored on DR |
Quarterly Origin Failover Test:
- 🔧 Temporarily deny CloudFront access to S3 us-east-1 via bucket policy (add temporary Deny statement for CloudFront Origin Access Identity)
- ⏱️ Measure CloudFront automatic failover time to eu-west-1
- ✅ Verify content served from eu-west-1 origin
- 🔙 Remove the temporary Deny from us-east-1 bucket policy and confirm failback to primary origin
- 📝 Document results and improvements
Semi-Annual DNS Failover Test:
- 🔧 Update Route 53 health check to force failure
- ⏱️ Measure DNS propagation time
- ✅ Verify GitHub Pages serving traffic
- 🔙 Restore Route 53 health check
- 📝 Document results and TTL impact
| Metric | Target | Current Status | Trend |
|---|---|---|---|
| 🎯 Availability | 99.998% | 99.999% (YTD) | ✅ Exceeding |
| ⚡ Origin Failover RTO | < 30 seconds | 18 seconds (last test) | ✅ On track |
| 🌐 DNS Failover RTO | 15 minutes | 14 minutes (last test) | ✅ On track |
| 💾 Data Synchronization | 0 RPO | 0 seconds (real-time) | ✅ On track |
| 🧪 BCP Testing | Quarterly | Last tested 2026-02 | ✅ Current |
| 📊 Monitoring Coverage | 100% | 100% (all endpoints) | ✅ Complete |
Note: The "Current Status" values in this table are illustrative planning examples. Actual operational metrics are monitored via AWS CloudWatch, Route 53 health check logs, and GitHub Pages status, and documented in operational runbooks.
As CEO/Founder is the sole employee, traditional business continuity teams are not possible. Riksdagsmonitor implements automated infrastructure resilience + comprehensive documentation:
Capabilities:
- Cloud Infrastructure Expertise: AWS Solutions Architect, 15+ years experience
- Automated Failover: CloudFront origin failover, Route 53 health checks (no manual intervention)
- Documentation: All procedures documented in ISMS for continuity
- Monitoring: CloudWatch alarms, Route 53 health checks, automated notifications
- Supplier Relationships: AWS Enterprise Support, GitHub Enterprise Support
| Control Type | Implementation | Effectiveness |
|---|---|---|
| 🤖 Automated Failover | CloudFront origin failover (< 30s), Route 53 DNS failover (15 min) | Eliminates manual recovery for primary scenarios |
| 📚 Documentation | Complete runbooks in BCPPlan.md, ARCHITECTURE.md, SECURITY_ARCHITECTURE.md | Enables recovery by any technical professional |
| 🔄 Infrastructure-as-Code (Planned) | AWS static site and DNS infrastructure to be codified in Terraform/CloudFormation (see FUTURE_SECURITY_ARCHITECTURE.md) | Future-state: fully reproducible infrastructure from version-controlled IaC |
| 📊 Comprehensive Monitoring | CloudWatch, Route 53 health checks, automated alerts | Real-time detection and notification |
| 💾 Geographic Redundancy | Multi-region S3 (us-east-1 + eu-west-1), GitHub Pages standby | No single point of failure |
- 🏗️ ARCHITECTURE.md - System architecture and AWS infrastructure design
- 🔐 SECURITY_ARCHITECTURE.md - Security controls and AWS security architecture
- 🚀 FUTURE_SECURITY_ARCHITECTURE.md - Security roadmap and planned enhancements
- 🎯 THREAT_MODEL.md - STRIDE threat analysis and risk assessment
- ⚙️ WORKFLOWS.md - CI/CD workflows and deployment automation
- 📖 README.md - Project overview and quick start guide
ℹ️ Alignment notice: WORKFLOWS.md, FUTURE_SECURITY_ARCHITECTURE.md and THREAT_MODEL.md are pending update to fully align with the dual-deployment continuity model and current primary hosting described in this BCPPlan. If there is any conflict regarding the current hosting/deployment architecture, this BCPPlan is the authoritative source.
This section provides detailed, step-by-step incident response playbooks for the three highest-probability incident scenarios for Riksdagsmonitor. All playbooks follow the PICERL framework: Preparation, Identification, Containment, Eradication, Recovery, Lessons Learned.
Playbook ID: IR-PB-001
Version: 1.0
Owner: James Pether Sörling, CEO
Last Reviewed: 2026-02-25
This playbook activates when any of the following are detected:
| Signal | Detection Method | Severity Indicator |
|---|---|---|
| Unexpected content changes in production | GitHub Actions diff in deploy log | HIGH if unauthorized |
| Unauthorized Git commits to main branch | GitHub audit log alert | CRITICAL |
| Branch protection bypass detected | GitHub security event | CRITICAL |
| Anomalous content detected by user report | User email to security@hack23.com | HIGH |
| SLSA attestation failure | GitHub Actions security job | HIGH |
| Unexpected language content injection | HTMLHint content validation | MEDIUM |
| Severity | Criteria | Response Time | Escalation |
|---|---|---|---|
| P1 - Critical | Unauthorized content in production, branch protection bypass, SLSA attestation failure | 15 minutes to containment | Immediate personal notification to CEO |
| P2 - High | Suspected tampering unconfirmed, anomalous content flagged | 1 hour to investigation | Alert within 30 minutes |
| P3 - Medium | Minor unexpected changes, validation warnings | 4 hours to resolution | Standard ISMS notification |
PHASE 1: DETECT (0-15 minutes for P1)
- Receive Alert — GitHub Actions notification, user report, or automated monitoring
- Verify Authenticity — Confirm alert is genuine (not false positive)
- Check GitHub Actions run logs for the deploy job
- Verify SHA-256 hashes in build metadata
- Review Git commit history on main branch
- Classify Severity — Apply classification matrix above
- Document Start Time — Record incident start timestamp in UTC
- Open Incident Record — Create GitHub Issue with label
security-incident
PHASE 2: TRIAGE (15-30 minutes for P1)
- Scope Assessment — Which files are affected? (index.html, all 14 language variants, news articles?)
- Impact Assessment — Is tampered content currently visible to users?
- Source Identification — Review GitHub audit log for:
Settings > Security > Audit log Filter: Action = "repo.create_actions_secret" or "git.push" or "protected_branch" - Blast Radius — Determine if compromise is isolated or widespread
PHASE 3: CONTAIN (30-60 minutes for P1)
- Immediate Rollback — Revert to last known good commit:
git log --oneline -20 # Identify last known good commit git revert HEAD...<last-good-sha> # Revert to good state git push origin main # Trigger redeploy
- Block Malicious User (if external) — Via GitHub repository settings
- Revoke Compromised Credentials — If credentials were used:
- Rotate all GitHub Secrets immediately
- Revoke compromised PATs
- Regenerate Amazon Bedrock API keys
- Enable Temporary Maintenance Mode — If content integrity cannot be confirmed:
- Temporarily set CloudFront to return 503 for affected paths
- Display maintenance page with explanation
PHASE 4: ERADICATE (1-4 hours)
- Root Cause Analysis — Determine exact attack vector:
- Social engineering?
- Compromised credentials?
- Supply chain attack via dependency?
- GitHub Actions workflow injection?
- Remove Malicious Content — Clean all affected files
- Verify Clean State — SHA-256 comparison against last known good
- Patch Vulnerability — Fix the root cause (update dependency, revoke credential, harden workflow)
PHASE 5: RECOVER (4-24 hours)
- Restore Service — Deploy verified clean content
- Verify Integrity — Automated integrity checks pass
- Monitor Closely — Increased monitoring for 72 hours post-incident
- Stakeholder Communication — Post transparent incident report (see template below)
PHASE 6: POST-INCIDENT (Within 72 hours)
- Lessons Learned Meeting — Document findings
- Update Controls — Implement additional preventive measures
- Update Threat Model — If new attack vector discovered
- NIS2 Assessment — Determine if ENISA notification required
Subject: [Riksdagsmonitor] Security Incident Report - Content Integrity
Incident: Potential content tampering detected
Date/Time: [UTC timestamp]
Severity: [P1/P2/P3]
Status: [Investigating / Contained / Resolved]
Summary:
We detected [brief description]. Our investigation found [findings].
Actions Taken:
1. [Action taken]
2. [Action taken]
Impact:
Content was [not affected / affected for X minutes] between [time] and [time].
Preventive Measures:
[Measures implemented to prevent recurrence]
Contact: security@hack23.com
# Step 1: Identify good commit
git log --oneline --graph --all | head -30
# Step 2: Verify content of last known good commit
git show <good-sha>:index.html | sha256sum
# Step 3: Create revert commit (preserves history)
git revert --no-commit <bad-sha>..<HEAD>
git commit -m "security: revert content tampering incident [IR-PB-001]"
# Step 4: Push and trigger redeploy
git push origin main
# Step 5: Verify production content
curl -s https://riksdagsmonitor.com/ | sha256sum- GitHub Actions run logs (download and archive)
- GitHub Audit Log export for incident timeframe
- Git commit history with diff
- SHA-256 hashes of affected and clean files
- CloudFront access logs for incident timeframe
- SLSA attestation records
- Sigstore transparency log entries
- Browser screenshots of tampered content (if visible)
- User reports with timestamps
- Credential access logs from GitHub
Playbook ID: IR-PB-002
Version: 1.0
Owner: James Pether Sörling, CEO
Last Reviewed: 2026-02-25
| Signal | Detection Method | Severity |
|---|---|---|
| GitHub Actions MCP job failure | Workflow notification email | HIGH |
| riksdag.se API returning 5xx errors | Pipeline error log | HIGH |
| API timeout after 30s | MCP client timeout log | MEDIUM |
| Data staleness alert (>48h) | Automated staleness checker | MEDIUM |
| Amazon Bedrock API unavailable | GitHub Actions job failure | HIGH |
| Zero articles generated for 3+ days | Manual monitoring check | HIGH |
| CIA platform export unavailable | Dashboard shows stale data | MEDIUM |
| Severity | Criteria | Response Time |
|---|---|---|
| P1 - Critical | Complete MCP pipeline down, 0 data updates for 24+ hours | 1 hour |
| P2 - High | Partial data failure, degraded content generation, 12-24 hour gap | 4 hours |
| P3 - Medium | Single source unavailable, minor staleness, pipeline flaky | 24 hours |
PHASE 1: DETECT AND VERIFY
- Confirm Outage — Check GitHub Actions run history:
- Navigate to Actions tab
- Filter by workflow:
news-generation.yml - Check last 5 runs for failure pattern
- Identify Scope — Determine which component is failing:
- Riksdag API unavailable?
- Amazon Bedrock rate limited or unavailable?
- riksdag-regering-mcp server issue?
- Network egress blocked by harden-runner?
- Check External Status Pages:
- https://www.riksdagen.se/sv/kontakt/ (Riksdag IT contact)
- https://status.aws.amazon.com/ (Amazon Bedrock status)
- https://www.githubstatus.com/ (GitHub Actions status)
- Classify Severity and start incident timer
PHASE 2: TRIAGE
- Check Cached Data Availability — Verify
cia-data/directory has recent data - Determine User Impact — Are dashboards showing stale data? How stale?
- Estimate Recovery Time — Is this an external outage (wait) or internal issue (fix)?
PHASE 3: CONTAIN / GRACEFUL DEGRADATION
- Activate Stale Data Banner — If data is more than 48 hours old:
- Edit
index.htmlto show data freshness warning - Deploy immediately
- Edit
- Use Cached Data — Pipeline automatically falls back to
cia-data/cache - Disable Failed Pipeline — If pipeline is producing errors, temporarily disable cron:
# Temporarily comment out schedule trigger in workflow YAML # on: # schedule: # - cron: '0 1 * * *'
- Document Outage Start — Record in incident log
PHASE 4: INVESTIGATE AND RESTORE
- External Outage: Wait for provider recovery, monitor status pages
- Internal Issue - API Change:
- Review Riksdag API changelog
- Update MCP server configuration
- Test with
npm run test:mcp
- Internal Issue - Credential:
- Verify Amazon Bedrock API key in GitHub Secrets
- Rotate key if expired or compromised
- Internal Issue - Rate Limiting:
- Implement exponential backoff
- Reduce fetch frequency temporarily
- Check Riksdag API terms of service
PHASE 5: RESTORE SERVICE
- Re-enable Pipeline — Restore cron schedule in workflow
- Run Manual Trigger —
workflow_dispatchto verify pipeline works - Verify Output — Confirm articles generate successfully in all 14 languages
- Remove Stale Banner — Update HTML once fresh data available
- Verify Dashboards — Confirm CIA data dashboards show current data
PHASE 6: POST-INCIDENT
- Document Root Cause — In incident GitHub Issue
- Add Monitoring — Alert if no successful pipeline run in 36 hours
- Update Runbooks — If new failure mode discovered
- Resilience Improvement — Implement recommendation from this incident
- MCP server responding to tool discovery
- Riksdag API returning valid JSON
- Amazon Bedrock API responding within 30s
- News generation pipeline completes without error
- 14 language articles successfully generated
- SHA-256 integrity check passes
- Git commit and PR created successfully
- CIA data dashboards showing fresh data
- Stale data banners removed from all 14 language pages
- GitHub Actions workflow next scheduled run confirmed
Subject: [Riksdagsmonitor] Service Notification - Data Pipeline Status
Status: [Investigating / Degraded / Restored]
Affected: Automated news generation and/or data dashboard updates
Date: [UTC date]
Current Status:
The automated data pipeline is [description].
Content published before [timestamp] remains accurate.
Expected Resolution:
[ETA or "Awaiting external provider recovery"]
Data Freshness:
Most recent data: [timestamp]
Best available data is displayed with staleness indicator.
Updates: Follow https://github.com/Hack23/riksdagsmonitor/issues
Playbook ID: IR-PB-003
Version: 1.0
Owner: James Pether Sörling, CEO
Last Reviewed: 2026-02-25
| Signal | Detection Method | Severity |
|---|---|---|
| Anomalous political content in generated articles | Human review gate | CRITICAL |
| SHA-256 hash mismatch for CIA data export | Integrity check in pipeline | HIGH |
| JSON schema validation failure from unexpected fields | Data validation log | HIGH |
| Statistics that contradict known parliamentary data | Quality scoring below threshold | HIGH |
| Dramatic unexpected change in voting statistics | Anomaly detection | HIGH |
| LLM output contains factually incorrect political claims | Human review | MEDIUM |
| Unexpected HTML injection in article content | HTMLHint detection | MEDIUM |
| Severity | Criteria | Response Time |
|---|---|---|
| P1 - Critical | Confirmed false political information published and live | 15 minutes to takedown |
| P2 - High | Suspected data poisoning, anomalous content caught by review | 1 hour investigation |
| P3 - Medium | Data anomaly detected, not yet published | 4 hours analysis |
PHASE 1: DETECT
- Initial Detection — Via human review gate, quality scoring, or user report
- Preserve Evidence — Before any changes:
- Screenshot anomalous content
- Download and archive current
cia-data/directory - Export GitHub Actions run log
- Record all timestamps in UTC
- Initial Assessment — Is this:
- LLM hallucination (most likely)?
- Corrupted source data from Riksdag API?
- Malicious injection into CIA platform export?
- Supply chain compromise in MCP server?
PHASE 2: TRIAGE
- Trace to Source — Identify where anomalous data entered:
# Check raw API response data cat cia-data/raw-export.json | jq '.["votingStats"]' # Compare with previous good data git diff HEAD~1 -- cia-data/ # Check MCP tool call logs in GitHub Actions # Navigate: Actions > run-id > news-generation > step-logs
- Scope Assessment — How much content is affected?
- Published vs Pending — Is anomalous content live or only in pipeline?
PHASE 3: CONTAIN
- If Content Is Live — Immediate quarantine:
# Revert to last clean version git revert HEAD --no-commit git commit -m "security: quarantine poisoned content [IR-PB-003]" git push origin main
- Pause Pipeline — Disable automated news generation until source validated:
- Comment out cron schedule in workflow YAML
- Push change to temporarily halt pipeline
- Quarantine Data Files — Move suspicious data to quarantine directory:
mkdir -p cia-data/quarantine/$(date +%Y%m%d) cp cia-data/*.json cia-data/quarantine/$(date +%Y%m%d)/
- Update Cache — Restore from last verified clean data backup (Git history)
PHASE 4: VALIDATE SOURCE DATA
- Cross-Reference with Riksdag.se — Manually verify key statistics:
- Party seat counts at https://riksdagen.se
- Recent vote outcomes
- Member information
- Verify CIA Platform Data — Check CIA platform directly:
- Access https://cia.hack23.com and compare key figures
- Check CIA platform's own data integrity logs
- Re-fetch Clean Data — Trigger fresh MCP data fetch after source verified:
npm run fetch:cia-data # Fetch fresh data npm run validate:data # Run validation suite
- Schema Comparison — Verify data structure matches expected schema:
npm run validate:schema -- --input cia-data/export.json
PHASE 5: ERADICATE
- Remove All Poisoned Content — From production and Git history if needed
- Re-validate All Published Articles — Check recent articles against source data
- Update Quality Filters — Add detection rules for the anomaly type seen
- Enhance LLM Guardrails — Add explicit factual verification prompts
PHASE 6: RECOVER
- Re-enable Pipeline — Restore cron schedule after validation
- Generate Fresh Articles — Replace any quarantined content
- Issue Correction — If incorrect information was public, issue transparent correction
- Enhanced Monitoring — Increase review frequency for 30 days
PHASE 7: POST-INCIDENT
- Root Cause Report — Document in incident GitHub Issue
- Control Enhancement — Implement additional preventive measures
- Threat Model Update — Update THREAT_MODEL.md with new attack vector
- Communication — If users were exposed to false information, issue public statement
## Data Poisoning Incident RCA - [DATE]
**Incident ID:** IR-PB-003-[YYYYMMDD]
**Severity:** [P1/P2/P3]
**Detection Time:** [UTC]
**Containment Time:** [UTC]
**Resolution Time:** [UTC]
### Timeline
| Time (UTC) | Event |
|------------|-------|
| [time] | Anomaly first detected by [method] |
| [time] | [Action taken] |
### Root Cause
[Describe the root cause: LLM hallucination / API corruption / supply chain]
### Attack Vector (if malicious)
[Describe how attacker introduced false data]
### Impact Assessment
- Content affected: [list of files/articles]
- Time live: [duration if published]
- User exposure: [estimated unique users who may have seen false content]
### Remediation Steps Taken
1. [Step taken]
2. [Step taken]
### Preventive Measures Implemented
1. [Control enhancement]
2. [Control enhancement]
### Lessons Learned
[Key takeaways for future incident prevention]| Measure | Implementation | Status |
|---|---|---|
| Human review gate for all AI-generated content | Mandatory PR review before merge | Active |
| Quality score threshold (0.8/1.0) | LLM self-evaluation before translation | Active |
| SHA-256 integrity hashing | Every article and data file | Active |
| JSON schema validation | Multi-stage data validation pipeline | Active |
| Anomaly detection for statistical outliers | Numeric range validation | Active |
| Source data cross-reference | Manual spot-check quarterly | Planned |
| LLM output factual verification | Citation requirement in prompts | Planned 2027 |
| Automated fact-checking against Riksdag.se | Selenium scraper validation | Planned 2028 |
| Playbook | ID | P1 Response | P2 Response | Primary Action | Evidence |
|---|---|---|---|---|---|
| Content Tampering | IR-PB-001 | 15 min contain | 1 hr contain | git revert + credential rotation |
GitHub audit + SHA-256 |
| MCP Outage | IR-PB-002 | 1 hr restore | 4 hr restore | Graceful degrade + pipeline fix | Actions logs + status pages |
| Data Poisoning | IR-PB-003 | 15 min takedown | 1 hr quarantine | Quarantine + source validation | Data diff + cross-ref |
📋 Document Control:
✅ Approved by: James Pether Sörling, CEO
📤 Distribution: Public
🏷️ Classification:
📅 Effective Date: 2026-02-25
⏰ Next Review: 2026-05-25
🎯 Framework Compliance: