Security & Reliability Fixes Applied

Date: 2025-11-13
Status: ✅ All 10 critical fixes completed

Summary

This document outlines the launch-blocking security and reliability issues that were identified and fixed in the Sentinel codebase.

Fixes Applied

✅ Fix 1: Removed `.env` from Git & Created Secure Templates

Problem: .env file containing secrets was committed to git. Default passwords (wildfire123, admin123) exposed in repo history.

Fixed:

Removed .env from git tracking
Created .env.production.template with generation instructions
Updated .gitignore to block all .env* files
Created SECURITY_CLEANUP.md with instructions to purge git history

Action Required:

Run git history cleanup (see SECURITY_CLEANUP.md)
Rotate all compromised secrets
Create .env.production from template with real secrets

Files:

.env.production.template (new)
.gitignore (updated)
SECURITY_CLEANUP.md (new)

✅ Fix 2 & 3: Fixed Hardcoded Secrets & CORS Wildcards

Problem:

Config files had hardcoded default passwords
ALLOWED_HOSTS = ["*"] permitted any host
SECRET_KEY default was "your-secret-key-here"

Fixed:

config.py: Added _get_required_env() that fails fast in production if secrets missing
database.py: Removed hardcoded DB URL, fails if not set in production
CORS ALLOWED_ORIGINS and ALLOWED_HOSTS now env-driven, strict by default
Production mode requires NODE_ENV=production + all secrets set

Files Changed:

apps/apigw/app/config.py
apps/apigw/app/database.py
infra/docker/docker-compose.prod.yml

Production Behavior:

Missing SECRET_KEY, DATABASE_URL, or other required vars → app exits with error
ALLOWED_HOSTS with wildcard * → deployment script rejects it
Development mode still allows defaults (with warnings)

✅ Fix 4, 5, 6: Added Auth, Rate Limiting, and Real Health Checks

Problem:

No authentication on any endpoint (admin, missions, detections all public)
No rate limiting (DoS/brute-force vulnerable)
Health check always returned "healthy" with static timestamp

Fixed:

Authentication:

Created apps/apigw/app/auth.py with JWT middleware
Dev mode: allows unauthenticated access (with warning header)
Production mode: requires Bearer <token> for all endpoints except /health, /readiness, /metrics, /docs
Added get_current_user() and require_permission() dependencies for route-level auth

Rate Limiting:

In-memory rate limiter (100 req/60s per IP by default)
Returns 429 Too Many Requests with Retry-After header
Adds X-RateLimit-* headers to responses

Health Checks:

/health → Liveness (is process alive?)
/readiness → Readiness (can serve traffic? checks DB, Redis, MQTT)
Returns 503 if dependencies unhealthy
Real timestamp (not hardcoded)

Files Changed:

apps/apigw/app/auth.py (new)
apps/apigw/app/main.py (integrated middleware + health checks)
apps/apigw/requirements.txt (added PyJWT, redis[async])

Usage:

# Protect a route
from app.auth import get_current_user, require_permission
from fastapi import Depends

@app.post("/missions")
async def create_mission(
    user: dict = Depends(require_permission(["mission:create"]))
):
    ...

✅ Fix 7: Fixed Next.js Console Dockerfile

Problem:

Dockerfile ran npm install --only=production before build
Next.js needs devDependencies to build
Single-stage build bundled unnecessary dependencies in final image

Fixed:

Multi-stage build:
1. deps: Install all deps
2. builder: Build app
3. runner: Copy only production artifacts
Enabled output: 'standalone' in next.config.js
Final image: ~50% smaller, no devDependencies

Files Changed:

apps/console/Dockerfile
apps/console/next.config.js

✅ Fix 8: Added Python CI with Tests

Problem:

CI only tested Node/TypeScript
FastAPI changes never tested
Integration tests existed but weren't run

Fixed:

Added python job to .github/workflows/ci.yml
Spins up Postgres + Redis services
Runs:
- ruff (linting)
- mypy (type checking)
- pytest with coverage
Uploads coverage to Codecov

Files Changed:

.github/workflows/ci.yml

✅ Fix 9 & 10: Fixed Database Migrations & Deployment Script

Problem (Migrations):

Base.metadata.create_all() ran on every API startup
In Kubernetes with multiple replicas → race conditions, schema corruption

Fixed:

Removed auto-migration from main.py startup
Added comment: run alembic upgrade head separately before deploy

Problem (Deployment Script):

Expected env vars but no validation
S3 backup, DB dump, AWS ECR assumed credentials existed
Rollback logic broke on first deployment (no previous revision)
.env.production loaded twice

Fixed:

validate_environment():
- Loads .env.production or exits
- Checks all required vars
- Rejects insecure passwords (wildfire123, admin123)
- Rejects wildcard in ALLOWED_HOSTS
build_and_push_images():
- Validates AWS_ACCOUNT_ID is set
- Checks ECR login success
rollback_deployment():
- Checks deployment exists before rollback
- Handles first deployment gracefully
- Returns error code if rollback fails

Files Changed:

apps/apigw/app/main.py
scripts/deploy-production.sh

Production Deployment Checklist

Before deploying to production:

1. Secrets

Copy .env.production.template to .env.production
Generate all secrets: openssl rand -hex 32
Set ALLOWED_ORIGINS to real domains (no wildcards)
Set ALLOWED_HOSTS to real hostnames (no *)
Set NODE_ENV=production

2. Git History

Follow SECURITY_CLEANUP.md to purge .env from history
Rotate any secrets that were in git

3. Dependencies

Run pnpm install (Node packages)
Run pip install -r apps/apigw/requirements.txt (Python packages)

4. Database

Run migrations separately: cd apps/apigw && alembic upgrade head
Do NOT rely on auto-migration at startup

5. Infrastructure

Configure K8s secrets for postgresql-credentials, redis-credentials, jwt-secrets, etc.
Set up AWS Secrets Manager or Vault (recommended)
Update load balancer health checks to use /readiness

6. CI/CD

Ensure all tests pass: pnpm test and pytest
Enable branch protection rules
Enable secret scanning (GitHub Advanced Security)

7. Deployment Script

Set AWS_ACCOUNT_ID in environment
Verify AWS credentials: aws sts get-caller-identity
Run: ./scripts/deploy-production.sh

Testing the Fixes

Local Development

Start services:
```
make docker-dev
```

API should start with warnings (dev mode allows defaults):

WARNING: Using dev default for DATABASE_URL. Set this in production!
WARNING: Using default JWT secret in development mode

Health checks:

curl http://localhost:8000/health       # Always returns 200
curl http://localhost:8000/readiness    # Returns 503 if DB/Redis down

Auth in dev mode:

# Works without token (dev mode)
curl http://localhost:8000/api/v1/telemetry

# Response includes warning header:
# X-Auth-Warning: No authentication in dev mode

Rate limiting:

# Send 101 requests rapidly
for i in {1..101}; do curl http://localhost:8000/api/v1/telemetry; done

# 101st request:
# HTTP 429 Too Many Requests
# Retry-After: 60

Production Mode

Set environment:

export NODE_ENV=production
export DATABASE_URL=postgresql://...
export SECRET_KEY=$(openssl rand -hex 32)
export ALLOWED_ORIGINS=https://console.example.com
export ALLOWED_HOSTS=api.example.com

API should fail if secrets missing:

unset SECRET_KEY
uvicorn app.main:app

# Output:
# FATAL: Required environment variable SECRET_KEY is not set
# exit code 1

Auth required:

# Without token → 401
curl http://localhost:8000/api/v1/missions

# With token → 200
curl -H "Authorization: Bearer <token>" http://localhost:8000/api/v1/missions

Monitoring Post-Deploy

Key Metrics to Watch

Health checks: /readiness returning 200
Error rate: Should not spike after deploy
Auth failures: Monitor 401/403 responses (should be low after migration)
Rate limit hits: Monitor 429 responses

Alerts to Configure

Critical:
- /readiness failing for >2 minutes
- Error rate >1%
- Database connection failures
Warning:
- Rate limit hit rate >10 req/min
- Auth failure rate >5%

Rollback Plan

If deployment fails:

Automatic: Script will attempt rollback if ROLLBACK_ON_FAILURE=true

Manual:

kubectl rollout undo deployment/wildfire-api-gateway -n wildfire-ops
kubectl rollout undo deployment/wildfire-console -n wildfire-ops

If first deployment fails: No previous revision exists
- Delete failed resources manually
- Fix issues
- Re-deploy

Next Steps (Optional Hardening)

While all critical issues are fixed, consider these additional improvements:

Replace in-memory rate limiter with Redis-backed (for multi-instance deployments)
Add request signing for service-to-service calls
Enable API request/response logging for audit trails
Add OpenTelemetry distributed tracing
Set up Secrets Manager rotation (AWS Secrets Manager, Vault)
Enable mTLS between services
Add input validation to all API endpoints (Pydantic models)
Set up WAF (Web Application Firewall) rules

Questions?

If you encounter issues:

Check logs: kubectl logs -f deployment/wildfire-api-gateway
Verify secrets: kubectl get secrets -n wildfire-ops
Test health: curl https://api.yourdomain.com/readiness
Review this document's checklists

For development issues, ensure NODE_ENV=development is set (allows defaults with warnings).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Security & Reliability Fixes Applied

Summary

Fixes Applied

✅ Fix 1: Removed `.env` from Git & Created Secure Templates

✅ Fix 2 & 3: Fixed Hardcoded Secrets & CORS Wildcards

✅ Fix 4, 5, 6: Added Auth, Rate Limiting, and Real Health Checks

✅ Fix 7: Fixed Next.js Console Dockerfile

✅ Fix 8: Added Python CI with Tests

✅ Fix 9 & 10: Fixed Database Migrations & Deployment Script

Production Deployment Checklist

1. Secrets

2. Git History

3. Dependencies

4. Database

5. Infrastructure

6. CI/CD

7. Deployment Script

Testing the Fixes

Local Development

Production Mode

Monitoring Post-Deploy

Key Metrics to Watch

Alerts to Configure

Rollback Plan

Next Steps (Optional Hardening)

Questions?

FilesExpand file tree

FIXES_APPLIED.md

Latest commit

History

FIXES_APPLIED.md

File metadata and controls

Security & Reliability Fixes Applied

Summary

Fixes Applied

✅ Fix 1: Removed .env from Git & Created Secure Templates

✅ Fix 2 & 3: Fixed Hardcoded Secrets & CORS Wildcards

✅ Fix 4, 5, 6: Added Auth, Rate Limiting, and Real Health Checks

✅ Fix 7: Fixed Next.js Console Dockerfile

✅ Fix 8: Added Python CI with Tests

✅ Fix 9 & 10: Fixed Database Migrations & Deployment Script

Production Deployment Checklist

1. Secrets

2. Git History

3. Dependencies

4. Database

5. Infrastructure

6. CI/CD

7. Deployment Script

Testing the Fixes

Local Development

Production Mode

Monitoring Post-Deploy

Key Metrics to Watch

Alerts to Configure

Rollback Plan

Next Steps (Optional Hardening)

Questions?

✅ Fix 1: Removed `.env` from Git & Created Secure Templates