fix: auto-update P2P address on AWS instance restart#1195
Conversation
EC2 nodes lose P2P connectivity after stop/start when AWS assigns a new public IP address. The external_address config retained the old IP, preventing peer connections. - Auto-detect public IP on every container start - Update external_address if IP changed - Extract P2P port from existing config to support custom ports - Respect TN_EXTERNAL_ADDRESS env var for Elastic IP users - Add container wait logic in tn-node-configure script - Added IP detection before config check - Extract P2P listen port from [p2p] section - Compare current vs new external address - Update config.toml if IP changed - Log all IP changes for visibility - Added 60s container startup wait in tn-node-configure - Prevents race condition on first configuration - Ensures containers are running before reporting success resolves: trufnetwork/truf-network#1262
Time Submission Status
|
Bug Report Checklist
@MicBun, please use git blame and specify the link to the commit link that has introduced this bug. |
WalkthroughIntroduces a runtime wait loop in AMI stack to confirm tn-node containers are up, and revises tn-node startup script in docker-compose template to auto-detect public IP and set/update p2p external_address during initial config or on existing config, standardizing port handling. Changes
Sequence Diagram(s)sequenceDiagram
autonumber
participant U as User/AMI Init
participant S as AMI Stack
participant DC as Docker Compose
participant TN as tn-node
U->>S: Trigger deployment
S->>DC: Start tn-node service
S->>S: Poll up to 60s for container status
alt Containers up
S->>U: Log success and continue
else Timeout
S->>U: Warn with log-check instructions
end
sequenceDiagram
autonumber
participant DC as Docker Compose
participant SS as tn-node Startup Script
participant FS as Filesystem (/root/.kwild)
participant NET as Network (IP Detect)
DC->>SS: Run tn-node entry script
SS->>NET: Resolve public IP (TN_EXTERNAL_ADDRESS / external)
SS->>SS: Determine P2P port (config or default 6600)
alt Existing config
SS->>FS: Read config.toml
SS->>FS: Update external_address if changed
else Fresh setup
SS->>SS: Set EXTERNAL_FLAG (only if IP present)
SS->>FS: Generate new config
end
SS->>DC: Start tn-node with updated settings
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Possibly related PRs
Suggested labels
Suggested reviewers
Poem
Pre-merge checks and finishing touches❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✨ Finishing touches
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
|
@pr-time-tracker bug commit not cause by previous commit |
There was a problem hiding this comment.
Actionable comments posted: 3
🧹 Nitpick comments (3)
deployments/infra/stacks/docker-compose.template.yml (2)
60-60: Simplify external address extraction.The sed command uses a complex regex that may not handle all valid external_address formats. Consider using a more robust approach.
Apply this diff to use a simpler extraction:
- CURRENT_EXTERNAL=$(grep "external_address" /root/.kwild/config.toml | sed 's/.*= *[\"'\'']*\([^\"'\'']*\)[\"'\'']*$/\1/') + CURRENT_EXTERNAL=$(grep "^external_address" /root/.kwild/config.toml | sed 's/^external_address[[:space:]]*=[[:space:]]*[\"'\'']*\([^\"'\'']*\)[\"'\'']*.*$/\1/')This anchors the pattern to line start and uses character classes for whitespace, making it more explicit and maintainable.
46-50: Consider adding retry logic for IP detection.The IP detection from AWS metadata service has a 2-second timeout but no retry logic. Transient network issues during container startup could cause this to fail unnecessarily.
Consider adding retry logic:
if [ -z "$$PUBLIC_IP" ]; then + MAX_RETRIES=3 + RETRY=0 + while [ $$RETRY -lt $$MAX_RETRIES ] && [ -z "$$PUBLIC_IP" ]; do if command -v wget >/dev/null 2>&1; then PUBLIC_IP=$$(wget -T 2 -qO- https://checkip.amazonaws.com/ 2>/dev/null || true) elif command -v curl >/dev/null 2>&1; then PUBLIC_IP=$$(curl -m 2 -s https://checkip.amazonaws.com/ 2>/dev/null || true) fi + [ -z "$$PUBLIC_IP" ] && RETRY=$$((RETRY + 1)) && sleep 1 + done fideployments/infra/stacks/ami_pipeline_stack.go (1)
377-386: Consider making timeout configurable.The 60-second timeout is hardcoded. Depending on system resources and network conditions, container startup time can vary significantly. Consider making this configurable or increasing the default.
Apply this diff to allow environment-based configuration:
# Wait for containers to actually start echo "Waiting for containers to start..." - MAX_WAIT=60 + MAX_WAIT=${TN_STARTUP_TIMEOUT:-60} ELAPSED=0 + echo "Timeout set to ${MAX_WAIT}s" while [ $ELAPSED -lt $MAX_WAIT ]; doThis allows operators to override the timeout via environment variable if needed, while maintaining the 60-second default.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (2)
deployments/infra/stacks/ami_pipeline_stack.go(1 hunks)deployments/infra/stacks/docker-compose.template.yml(1 hunks)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: acceptance-test
🔇 Additional comments (2)
deployments/infra/stacks/docker-compose.template.yml (2)
43-52: LGTM: IP detection logic is well-structured.The IP detection properly prioritizes the TN_EXTERNAL_ADDRESS environment variable and falls back to AWS metadata service with appropriate timeouts. The support for both wget and curl ensures compatibility across different environments.
76-77: LGTM: Simplified external flag logic.The simplified approach for setting the external address flag is clearer and more maintainable than the previous implementation. Using port 6600 as the default is appropriate for Kwil's P2P protocol.
|
Merged immediately, so changes can be reflected |
EC2 nodes lose P2P connectivity after stop/start when AWS assigns a new public IP address. The external_address config retained the old IP, preventing peer connections.
Auto-detect public IP on every container start
Update external_address if IP changed
Extract P2P port from existing config to support custom ports
Respect TN_EXTERNAL_ADDRESS env var for Elastic IP users
Add container wait logic in tn-node-configure script
Added IP detection before config check
Extract P2P listen port from [p2p] section
Compare current vs new external address
Update config.toml if IP changed
Log all IP changes for visibility
Added 60s container startup wait in tn-node-configure
Prevents race condition on first configuration
Ensures containers are running before reporting success
resolves: https://github.com/trufnetwork/truf-network/issues/1262
Summary by CodeRabbit
New Features
Bug Fixes
Chores