Skip to content

fix: auto-update P2P address on AWS instance restart#1195

Merged
MicBun merged 1 commit intomainfrom
autoUpdateP2P
Oct 8, 2025
Merged

fix: auto-update P2P address on AWS instance restart#1195
MicBun merged 1 commit intomainfrom
autoUpdateP2P

Conversation

@MicBun
Copy link
Member

@MicBun MicBun commented Oct 8, 2025

EC2 nodes lose P2P connectivity after stop/start when AWS assigns a new public IP address. The external_address config retained the old IP, preventing peer connections.

  • Auto-detect public IP on every container start

  • Update external_address if IP changed

  • Extract P2P port from existing config to support custom ports

  • Respect TN_EXTERNAL_ADDRESS env var for Elastic IP users

  • Add container wait logic in tn-node-configure script

  • Added IP detection before config check

  • Extract P2P listen port from [p2p] section

  • Compare current vs new external address

  • Update config.toml if IP changed

  • Log all IP changes for visibility

  • Added 60s container startup wait in tn-node-configure

  • Prevents race condition on first configuration

  • Ensures containers are running before reporting success

resolves: https://github.com/trufnetwork/truf-network/issues/1262

Summary by CodeRabbit

  • New Features

    • Automatically detects the public IP and updates the node’s external address during startup.
    • Simplifies initial configuration by setting the external address when storage is empty, reducing manual setup.
  • Bug Fixes

    • Improves service reliability by waiting for containers to become healthy before proceeding, with a clear timeout and guidance if startup takes too long.
  • Chores

    • Streamlines startup flow to consistently use the detected external address across network types.

EC2 nodes lose P2P connectivity after stop/start when AWS assigns a new
public IP address. The external_address config retained the old IP,
preventing peer connections.

- Auto-detect public IP on every container start
- Update external_address if IP changed
- Extract P2P port from existing config to support custom ports
- Respect TN_EXTERNAL_ADDRESS env var for Elastic IP users
- Add container wait logic in tn-node-configure script

- Added IP detection before config check
- Extract P2P listen port from [p2p] section
- Compare current vs new external address
- Update config.toml if IP changed
- Log all IP changes for visibility

- Added 60s container startup wait in tn-node-configure
- Prevents race condition on first configuration
- Ensures containers are running before reporting success

resolves: trufnetwork/truf-network#1262
@MicBun MicBun requested a review from outerlook October 8, 2025 05:22
@MicBun MicBun self-assigned this Oct 8, 2025
@holdex
Copy link

holdex bot commented Oct 8, 2025

Time Submission Status

Member Status Time Action Last Update
MicBun ✅ Submitted 4h 30min Update time Oct 8, 2025, 5:27 AM

@holdex
Copy link

holdex bot commented Oct 8, 2025

Bug Report Checklist

Status Commit Link Bug Author
❌ Not Submitted

@MicBun, please use git blame and specify the link to the commit link that has introduced this bug.

Send the following message in this PR: `@pr-time-tracker bug commit [link](url) && bug author @name`

@coderabbitai
Copy link

coderabbitai bot commented Oct 8, 2025

Walkthrough

Introduces a runtime wait loop in AMI stack to confirm tn-node containers are up, and revises tn-node startup script in docker-compose template to auto-detect public IP and set/update p2p external_address during initial config or on existing config, standardizing port handling.

Changes

Cohort / File(s) Summary
AMI runtime wait logic
deployments/infra/stacks/ami_pipeline_stack.go
Adds a 60s polling loop after starting tn-node to check docker compose status; logs success or warns with log-check instructions on timeout; conditionally proceeds to completion messaging.
Docker compose tn-node startup/config
deployments/infra/stacks/docker-compose.template.yml
Startup script now detects public IP (env TN_EXTERNAL_ADDRESS or external lookup), prints it, and updates existing /root/.kwild/config.toml external_address using detected IP and P2P port (default 6600). On fresh config, sets EXTERNAL_FLAG to --p2p.external-address <IP>:6600 only when IP is present; removes prior hard-coded port/logic. Retains network-type branching and start flow.

Sequence Diagram(s)

sequenceDiagram
  autonumber
  participant U as User/AMI Init
  participant S as AMI Stack
  participant DC as Docker Compose
  participant TN as tn-node

  U->>S: Trigger deployment
  S->>DC: Start tn-node service
  S->>S: Poll up to 60s for container status
  alt Containers up
    S->>U: Log success and continue
  else Timeout
    S->>U: Warn with log-check instructions
  end
Loading
sequenceDiagram
  autonumber
  participant DC as Docker Compose
  participant SS as tn-node Startup Script
  participant FS as Filesystem (/root/.kwild)
  participant NET as Network (IP Detect)

  DC->>SS: Run tn-node entry script
  SS->>NET: Resolve public IP (TN_EXTERNAL_ADDRESS / external)
  SS->>SS: Determine P2P port (config or default 6600)
  alt Existing config
    SS->>FS: Read config.toml
    SS->>FS: Update external_address if changed
  else Fresh setup
    SS->>SS: Set EXTERNAL_FLAG (only if IP present)
    SS->>FS: Generate new config
  end
  SS->>DC: Start tn-node with updated settings
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly related PRs

Suggested labels

type: chore

Suggested reviewers

  • outerlook

Poem

I sniff the winds for IP signs, hoo-ray!
Compose awakens, nodes come out to play.
Sixty ticks I wait, ears tall, no fret—
Ports at six-six-double-oh are set.
With logs and carrots, I hop deploys today. 🥕🐇

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. You can run @coderabbitai generate docstrings to improve docstring coverage.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title Check ✅ Passed The title clearly and concisely describes the primary fix of automatically updating the P2P external address when an AWS instance restarts, matching the core change of detecting and applying new public IPs in the configuration.
✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch autoUpdateP2P

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@MicBun
Copy link
Member Author

MicBun commented Oct 8, 2025

@pr-time-tracker bug commit not cause by previous commit

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🧹 Nitpick comments (3)
deployments/infra/stacks/docker-compose.template.yml (2)

60-60: Simplify external address extraction.

The sed command uses a complex regex that may not handle all valid external_address formats. Consider using a more robust approach.

Apply this diff to use a simpler extraction:

-          CURRENT_EXTERNAL=$(grep "external_address" /root/.kwild/config.toml | sed 's/.*= *[\"'\'']*\([^\"'\'']*\)[\"'\'']*$/\1/')
+          CURRENT_EXTERNAL=$(grep "^external_address" /root/.kwild/config.toml | sed 's/^external_address[[:space:]]*=[[:space:]]*[\"'\'']*\([^\"'\'']*\)[\"'\'']*.*$/\1/')

This anchors the pattern to line start and uses character classes for whitespace, making it more explicit and maintainable.


46-50: Consider adding retry logic for IP detection.

The IP detection from AWS metadata service has a 2-second timeout but no retry logic. Transient network issues during container startup could cause this to fail unnecessarily.

Consider adding retry logic:

         if [ -z "$$PUBLIC_IP" ]; then
+          MAX_RETRIES=3
+          RETRY=0
+          while [ $$RETRY -lt $$MAX_RETRIES ] && [ -z "$$PUBLIC_IP" ]; do
             if command -v wget >/dev/null 2>&1; then
               PUBLIC_IP=$$(wget -T 2 -qO- https://checkip.amazonaws.com/ 2>/dev/null || true)
             elif command -v curl >/dev/null 2>&1; then
               PUBLIC_IP=$$(curl -m 2 -s https://checkip.amazonaws.com/ 2>/dev/null || true)
             fi
+            [ -z "$$PUBLIC_IP" ] && RETRY=$$((RETRY + 1)) && sleep 1
+          done
         fi
deployments/infra/stacks/ami_pipeline_stack.go (1)

377-386: Consider making timeout configurable.

The 60-second timeout is hardcoded. Depending on system resources and network conditions, container startup time can vary significantly. Consider making this configurable or increasing the default.

Apply this diff to allow environment-based configuration:

               # Wait for containers to actually start
               echo "Waiting for containers to start..."
-              MAX_WAIT=60
+              MAX_WAIT=${TN_STARTUP_TIMEOUT:-60}
               ELAPSED=0
+              echo "Timeout set to ${MAX_WAIT}s"
               while [ $ELAPSED -lt $MAX_WAIT ]; do

This allows operators to override the timeout via environment variable if needed, while maintaining the 60-second default.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between df5fc14 and 6282f61.

📒 Files selected for processing (2)
  • deployments/infra/stacks/ami_pipeline_stack.go (1 hunks)
  • deployments/infra/stacks/docker-compose.template.yml (1 hunks)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: acceptance-test
🔇 Additional comments (2)
deployments/infra/stacks/docker-compose.template.yml (2)

43-52: LGTM: IP detection logic is well-structured.

The IP detection properly prioritizes the TN_EXTERNAL_ADDRESS environment variable and falls back to AWS metadata service with appropriate timeouts. The support for both wget and curl ensures compatibility across different environments.


76-77: LGTM: Simplified external flag logic.

The simplified approach for setting the external address flag is clearer and more maintainable than the previous implementation. Using port 6600 as the default is appropriate for Kwil's P2P protocol.

@MicBun
Copy link
Member Author

MicBun commented Oct 8, 2025

Merged immediately, so changes can be reflected

@MicBun MicBun merged commit 54aa8b9 into main Oct 8, 2025
10 checks passed
@MicBun MicBun deleted the autoUpdateP2P branch October 8, 2025 11:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant