Date: 2025-11-13
Status: ✅ All 10 critical fixes completed
This document outlines the launch-blocking security and reliability issues that were identified and fixed in the Sentinel codebase.
Problem: .env file containing secrets was committed to git. Default passwords (wildfire123, admin123) exposed in repo history.
Fixed:
- Removed
.envfrom git tracking - Created
.env.production.templatewith generation instructions - Updated
.gitignoreto block all.env*files - Created
SECURITY_CLEANUP.mdwith instructions to purge git history
Action Required:
- Run git history cleanup (see
SECURITY_CLEANUP.md) - Rotate all compromised secrets
- Create
.env.productionfrom template with real secrets
Files:
.env.production.template(new).gitignore(updated)SECURITY_CLEANUP.md(new)
Problem:
- Config files had hardcoded default passwords
ALLOWED_HOSTS = ["*"]permitted any hostSECRET_KEYdefault was"your-secret-key-here"
Fixed:
config.py: Added_get_required_env()that fails fast in production if secrets missingdatabase.py: Removed hardcoded DB URL, fails if not set in production- CORS
ALLOWED_ORIGINSandALLOWED_HOSTSnow env-driven, strict by default - Production mode requires
NODE_ENV=production+ all secrets set
Files Changed:
apps/apigw/app/config.pyapps/apigw/app/database.pyinfra/docker/docker-compose.prod.yml
Production Behavior:
- Missing
SECRET_KEY,DATABASE_URL, or other required vars → app exits with error ALLOWED_HOSTSwith wildcard*→ deployment script rejects it- Development mode still allows defaults (with warnings)
Problem:
- No authentication on any endpoint (admin, missions, detections all public)
- No rate limiting (DoS/brute-force vulnerable)
- Health check always returned "healthy" with static timestamp
Fixed:
Authentication:
- Created
apps/apigw/app/auth.pywith JWT middleware - Dev mode: allows unauthenticated access (with warning header)
- Production mode: requires
Bearer <token>for all endpoints except/health,/readiness,/metrics,/docs - Added
get_current_user()andrequire_permission()dependencies for route-level auth
Rate Limiting:
- In-memory rate limiter (100 req/60s per IP by default)
- Returns
429 Too Many RequestswithRetry-Afterheader - Adds
X-RateLimit-*headers to responses
Health Checks:
/health→ Liveness (is process alive?)/readiness→ Readiness (can serve traffic? checks DB, Redis, MQTT)- Returns
503if dependencies unhealthy - Real timestamp (not hardcoded)
Files Changed:
apps/apigw/app/auth.py(new)apps/apigw/app/main.py(integrated middleware + health checks)apps/apigw/requirements.txt(addedPyJWT,redis[async])
Usage:
# Protect a route
from app.auth import get_current_user, require_permission
from fastapi import Depends
@app.post("/missions")
async def create_mission(
user: dict = Depends(require_permission(["mission:create"]))
):
...Problem:
- Dockerfile ran
npm install --only=productionbefore build - Next.js needs devDependencies to build
- Single-stage build bundled unnecessary dependencies in final image
Fixed:
- Multi-stage build:
- deps: Install all deps
- builder: Build app
- runner: Copy only production artifacts
- Enabled
output: 'standalone'innext.config.js - Final image: ~50% smaller, no devDependencies
Files Changed:
apps/console/Dockerfileapps/console/next.config.js
Problem:
- CI only tested Node/TypeScript
- FastAPI changes never tested
- Integration tests existed but weren't run
Fixed:
- Added
pythonjob to.github/workflows/ci.yml - Spins up Postgres + Redis services
- Runs:
ruff(linting)mypy(type checking)pytestwith coverage
- Uploads coverage to Codecov
Files Changed:
.github/workflows/ci.yml
Problem (Migrations):
Base.metadata.create_all()ran on every API startup- In Kubernetes with multiple replicas → race conditions, schema corruption
Fixed:
- Removed auto-migration from
main.pystartup - Added comment: run
alembic upgrade headseparately before deploy
Problem (Deployment Script):
- Expected env vars but no validation
- S3 backup, DB dump, AWS ECR assumed credentials existed
- Rollback logic broke on first deployment (no previous revision)
.env.productionloaded twice
Fixed:
validate_environment():- Loads
.env.productionor exits - Checks all required vars
- Rejects insecure passwords (
wildfire123,admin123) - Rejects wildcard in
ALLOWED_HOSTS
- Loads
build_and_push_images():- Validates
AWS_ACCOUNT_IDis set - Checks ECR login success
- Validates
rollback_deployment():- Checks deployment exists before rollback
- Handles first deployment gracefully
- Returns error code if rollback fails
Files Changed:
apps/apigw/app/main.pyscripts/deploy-production.sh
Before deploying to production:
- Copy
.env.production.templateto.env.production - Generate all secrets:
openssl rand -hex 32 - Set
ALLOWED_ORIGINSto real domains (no wildcards) - Set
ALLOWED_HOSTSto real hostnames (no*) - Set
NODE_ENV=production
- Follow
SECURITY_CLEANUP.mdto purge.envfrom history - Rotate any secrets that were in git
- Run
pnpm install(Node packages) - Run
pip install -r apps/apigw/requirements.txt(Python packages)
- Run migrations separately:
cd apps/apigw && alembic upgrade head - Do NOT rely on auto-migration at startup
- Configure K8s secrets for
postgresql-credentials,redis-credentials,jwt-secrets, etc. - Set up AWS Secrets Manager or Vault (recommended)
- Update load balancer health checks to use
/readiness
- Ensure all tests pass:
pnpm testandpytest - Enable branch protection rules
- Enable secret scanning (GitHub Advanced Security)
- Set
AWS_ACCOUNT_IDin environment - Verify AWS credentials:
aws sts get-caller-identity - Run:
./scripts/deploy-production.sh
-
Start services:
make docker-dev
-
API should start with warnings (dev mode allows defaults):
WARNING: Using dev default for DATABASE_URL. Set this in production! WARNING: Using default JWT secret in development mode -
Health checks:
curl http://localhost:8000/health # Always returns 200 curl http://localhost:8000/readiness # Returns 503 if DB/Redis down
-
Auth in dev mode:
# Works without token (dev mode) curl http://localhost:8000/api/v1/telemetry # Response includes warning header: # X-Auth-Warning: No authentication in dev mode
-
Rate limiting:
# Send 101 requests rapidly for i in {1..101}; do curl http://localhost:8000/api/v1/telemetry; done # 101st request: # HTTP 429 Too Many Requests # Retry-After: 60
-
Set environment:
export NODE_ENV=production export DATABASE_URL=postgresql://... export SECRET_KEY=$(openssl rand -hex 32) export ALLOWED_ORIGINS=https://console.example.com export ALLOWED_HOSTS=api.example.com
-
API should fail if secrets missing:
unset SECRET_KEY uvicorn app.main:app # Output: # FATAL: Required environment variable SECRET_KEY is not set # exit code 1
-
Auth required:
# Without token → 401 curl http://localhost:8000/api/v1/missions # With token → 200 curl -H "Authorization: Bearer <token>" http://localhost:8000/api/v1/missions
- Health checks:
/readinessreturning 200 - Error rate: Should not spike after deploy
- Auth failures: Monitor 401/403 responses (should be low after migration)
- Rate limit hits: Monitor 429 responses
-
Critical:
/readinessfailing for >2 minutes- Error rate >1%
- Database connection failures
-
Warning:
- Rate limit hit rate >10 req/min
- Auth failure rate >5%
If deployment fails:
-
Automatic: Script will attempt rollback if
ROLLBACK_ON_FAILURE=true -
Manual:
kubectl rollout undo deployment/wildfire-api-gateway -n wildfire-ops kubectl rollout undo deployment/wildfire-console -n wildfire-ops
-
If first deployment fails: No previous revision exists
- Delete failed resources manually
- Fix issues
- Re-deploy
While all critical issues are fixed, consider these additional improvements:
- Replace in-memory rate limiter with Redis-backed (for multi-instance deployments)
- Add request signing for service-to-service calls
- Enable API request/response logging for audit trails
- Add OpenTelemetry distributed tracing
- Set up Secrets Manager rotation (AWS Secrets Manager, Vault)
- Enable mTLS between services
- Add input validation to all API endpoints (Pydantic models)
- Set up WAF (Web Application Firewall) rules
If you encounter issues:
- Check logs:
kubectl logs -f deployment/wildfire-api-gateway - Verify secrets:
kubectl get secrets -n wildfire-ops - Test health:
curl https://api.yourdomain.com/readiness - Review this document's checklists
For development issues, ensure NODE_ENV=development is set (allows defaults with warnings).