-
-
Notifications
You must be signed in to change notification settings - Fork 171
fix: simultaneous connect dead loop #244
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
|
@halturin hi, If this PR change is not very elegant, I'd be very happy if you could submit a more elegant fix. |
|
Thanks for the report, it's a great catch. May you please check my fix for your use case? see #245 |
|
@halturin wait, still throw those errors and never stop I push a simple demo https://github.com/qjpcpu/simultaneous-connect , you can reproduce this issue by it. |
|
Actually, the logic is correct. You can see that by adding this line a.SendAfter(a.PID(), "send_msg", time.Second*3)
+ a.Log().Info("Total peers: %d", len(a.Node().Network().Nodes()))and disable "trace" level. it takes some time since it is establishing ~15K tcp connection (3 per "node-to-node") but in the end you will see Those "mismatch" messages are part of the logic that handles collisions. Can be ignored. PS: Not sure why a full-mesh topology is needed here - it’s generally not considered a good design choice. |
|
@halturin I have a large cluster with over 2,000 nodes, and simultaneous interconnections between nodes are almost inevitable. However, the fix you proposed converges too slowly and doesn’t seem to address my issue. The occurrence of an "id mismatch" indicates that there is silent failure in Send operations, which the upper-layer application cannot detect (if everything falls back to SendImportant, performance becomes too poor). Additionally, during stress testing, about 6 out of 100 nodes consistently fail to achieve convergence, causing communication between these nodes to remain unrecoverable, unlike what you described—that convergence and self-healing would eventually occur. The fix I suggested earlier achieves convergence much more quickly. Could you please review whether there is a better implementation approach? |
|
The main problem is that we are closing tcp link (in both approaches), which means we might lose a message anyway. I'll try to redesign the solution to merge tcp link into a single connection which basically the framework does, but there are two roles - dialer and acceptor. In the case of a collision, both nodes are dialers and acceptors simultaneously. Here is the diagram of my current fix, which is more like a workaround. sequenceDiagram
box alice@localhost
participant AliceMgr as Network Manager<br/>(n.connections)
participant AliceOut as Outbound Dialer
participant AliceAcc as Acceptor Loop
end
box bob@localhost
participant BobAcc as Acceptor Loop
participant BobOut as Outbound Dialer
participant BobMgr as Network Manager<br/>(n.connections)
end
Note over AliceMgr,BobMgr: T0: Both nodes call GetNode() simultaneously
par alice dials bob
AliceMgr->>AliceOut: GetNode(bob) → connect()
AliceOut->>BobAcc: TCP Connect
BobAcc-->>AliceOut: TCP Accept
AliceOut->>BobAcc: MessageHello (Start)<br/>{Creation: 123}
BobAcc-->>AliceOut: Handshake response
AliceOut->>AliceMgr: ✅ registerConnection(bob)<br/>LoadOrStore(bob) → SUCCESS
Note over AliceMgr: bob registered in n.connections
and bob dials alice
BobMgr->>BobOut: GetNode(alice) → connect()
BobOut->>AliceAcc: TCP Connect
AliceAcc-->>BobOut: TCP Accept
BobOut->>AliceAcc: MessageHello (Start)<br/>{Creation: 456}
AliceAcc-->>BobOut: Handshake response
BobOut->>BobMgr: ✅ registerConnection(alice)<br/>LoadOrStore(alice) → SUCCESS
Note over BobMgr: alice registered in n.connections
end
Note over AliceMgr,BobMgr: T1: Both have registered each other from OUTBOUND connections
par alice's acceptor receives inbound from bob
Note over AliceAcc: Inbound TCP from bob
BobOut->>AliceAcc: MessageHello (Start)<br/>{Creation: 456}
rect rgb(240, 240, 255)
Note over AliceAcc: Handshake complete<br/>result.Peer = bob<br/>result.PeerCreation = 456
AliceAcc->>AliceMgr: Check: n.connections.Load(bob)
AliceMgr-->>AliceAcc: ❌ bob ALREADY EXISTS
Note over AliceAcc: PeerCreation = 456 ≠ 0<br/>→ Start handshake<br/>🔴 COLLISION DETECTED
Note over AliceAcc: Tie-breaker:<br/>bob > alice<br/>→ alice wins
AliceAcc->>AliceAcc: Close inbound TCP
Note over AliceAcc: Keep existing outbound
end
and bob's acceptor receives inbound from alice
Note over BobAcc: Inbound TCP from alice
AliceOut->>BobAcc: MessageHello (Start)<br/>{Creation: 123}
rect rgb(255, 240, 240)
Note over BobAcc: Handshake complete<br/>result.Peer = alice<br/>result.PeerCreation = 123
BobAcc->>BobMgr: Check: n.connections.Load(alice)
BobMgr-->>BobAcc: ❌ alice ALREADY EXISTS
Note over BobAcc: PeerCreation = 123 ≠ 0<br/>→ Start handshake<br/>🔴 COLLISION DETECTED
Note over BobAcc: Tie-breaker:<br/>alice < bob<br/>→ alice wins
BobMgr->>BobMgr: connections.Delete(alice)
Note over BobAcc: Terminate outbound<br/>Accept inbound
BobAcc->>BobMgr: registerConnection(alice)<br/>from inbound
end
end
Note over AliceMgr,BobMgr: T2: Collision resolved<br/>alice: outbound connection to bob<br/>bob: inbound connection from alice
AliceMgr->>AliceOut: Join pool members
AliceOut->>BobAcc: MessageJoin(ConnectionID)<br/>{PeerCreation: 0}
BobAcc->>BobMgr: Check: n.connections.Load(alice)
BobMgr-->>BobAcc: ✅ alice exists
Note over BobAcc: PeerCreation = 0<br/>→ Join (pool member)
BobAcc->>BobMgr: conn.Join() - add pool member
BobMgr-->>BobAcc: Success
BobAcc-->>AliceOut: ACK
AliceOut-->>AliceMgr: Pool ready
rect rgb(240, 255, 240)
Note over AliceMgr: RemoteNode{bob}<br/>PoolDSN: [127.0.0.1:11146]<br/>Role: DIALER
end
rect rgb(240, 255, 240)
Note over BobMgr: RemoteNode{alice}<br/>PoolDSN: []<br/>Role: ACCEPTOR
end
Note over AliceMgr,BobMgr: ✅ Single logical connection with connection pool<br/>alice is dialer, bob is acceptor
|
|
Waiting for your new solution ^_^ |
Context / Problem
Under “simultaneous connect” (both nodes dialing each other at the same time) and/or during pool re-dial, the acceptor side can receive a Join attempt for an already-existing logical connection but with a different ConnectionID .
Today this surfaces as noisy logs like unable to join ... connection id mismatch , and on the dialing side a failed Join can be misinterpreted as a successful recovery of a pool link, causing repeated retries and log storms.
What This PR Changes
Deterministic connection collision handling (simultaneous connect)
Explicit Join accept/reject signal for pool re-dial (protocol change within EHS Join flow)
Files Changed
node/network.go
net/handshake/join.go
testing/tests/002_distributed/t008_simultaneous_connect_test.go (new)
Compatibility Notes (Important)
This PR changes the wire behavior of the Ergo Handshake Join path by requiring an additional 1-byte status after Accept .
That means it is not backward compatible with older Ergo nodes that don’t send/read this byte during Join() . Mixed-version clusters may fail to re-dial/join pool links until all nodes are upgraded.