Skip to content

rpcclient: fix several synchronization bugs#2500

Open
starius wants to merge 4 commits intobtcsuite:masterfrom
starius:rpcclient-fixes
Open

rpcclient: fix several synchronization bugs#2500
starius wants to merge 4 commits intobtcsuite:masterfrom
starius:rpcclient-fixes

Conversation

@starius
Copy link
Copy Markdown
Contributor

@starius starius commented Mar 20, 2026

Change Description

Incorporated #2451 :

Modify the rpcclient http POST call to ensure that a shutdown immediately interrupts in-flight requests, which otherwise would have to wait until timeout.

Fix 3 other problems in rpcclient:

  • HTTP POST shutdown can deadlock WaitForShutdown due to double-response race
  • Batch-mode Send() error path leaves queued per-request futures unresolved
  • NewBatch starts HTTP POST handlers twice, increasing concurrency surface

Steps to Test

go test ./rpcclient -count=1

Each test is a regression test. It fails if the patch is reverted.

Pull Request Checklist

Testing

  • Your PR passes all CI checks.
  • Tests covering the positive and negative (error paths) are included.
  • Bug fixes contain tests triggering the bug to prevent regressions.

Code Style and Documentation

📝 Please see our Contribution Guidelines for further guidance.

@saubyk
Copy link
Copy Markdown
Collaborator

saubyk commented Mar 23, 2026

hello @wydengyre @seeforschauer @jcvernaleo would you consider reviewing this one?

Copy link
Copy Markdown

@seeforschauer seeforschauer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM — all three bugs are real and the fixes are correct. I've been working in rpcclient/infrastructure.go recently (#2506, #2505), so I have direct context on these paths.

Commit structure is clean (one bug, one test, one commit). Each test fails when the fix is reverted — solid regression coverage.

One suggestion on sendPostRequest (inline) for tighter shutdown determinism using the priority-select pattern already in addRequest. Two minor notes on failBatchRequests.

Comment on lines 931 to 942
// Atomically either queue the request or fail it due to shutdown.
//
// This avoids delivering two terminal responses to the same request, which
// can otherwise block shutdown cleanup on the second send.
select {
case <-c.shutdown:
jReq.responseChan <- &Response{result: nil, err: ErrClientShutdown}
default:
}
return

select {
case c.sendPostChan <- jReq:
log.Tracef("Sent command [%s] with id %d", jReq.method, jReq.id)

case <-c.shutdown:
return
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion (non-blocking): Consider the priority-select pattern here.

When shutdown is already closed (its permanent state after Shutdown()), the single select randomly picks between responding and enqueueing (~50/50 per Go spec). If sendPostHandler has already exited cleanup, the enqueued request's future is never resolved.

A non-blocking shutdown guard first makes the common post-shutdown path deterministic — addRequest (line 213) already uses this exact pattern for the same reason. The remaining race (shutdown closing between the two selects) has a much narrower window.

Suggested change
// Atomically either queue the request or fail it due to shutdown.
//
// This avoids delivering two terminal responses to the same request, which
// can otherwise block shutdown cleanup on the second send.
select {
case <-c.shutdown:
jReq.responseChan <- &Response{result: nil, err: ErrClientShutdown}
default:
}
return
select {
case c.sendPostChan <- jReq:
log.Tracef("Sent command [%s] with id %d", jReq.method, jReq.id)
case <-c.shutdown:
return
}
// Prefer shutdown: if already closed, fail the request immediately.
// This avoids a random race between shutdown and enqueue when both
// channels are ready, consistent with the guard in addRequest.
select {
case <-c.shutdown:
jReq.responseChan <- &Response{result: nil, err: ErrClientShutdown}
return
default:
}
// Normal path: enqueue or fail on shutdown. Exactly one outcome.
select {
case c.sendPostChan <- jReq:
log.Tracef("Sent command [%s] with id %d", jReq.method, jReq.id)
case <-c.shutdown:
jReq.responseChan <- &Response{result: nil, err: ErrClientShutdown}
}

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is effectively how the code already was.

@starius isn't it better to prioritize the shutdown path?

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, the structure is two selects like the original — the critical difference is the return after the first shutdown case. The original fell through into the second select even after sending ErrClientShutdown, which allowed the double-resolve.

With the return, it becomes the standard priority-select: if shutdown is already closed, respond deterministically and exit. The second select is only reached when shutdown wasn't closed at check time.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed! Now it prioritizes the shutdown path:

// sendPostRequest sends the passed HTTP request to the RPC server using the
// HTTP client associated with the client.  It is backed by a buffered channel,
// so it will not block until the send channel is full.
func (c *Client) sendPostRequest(jReq *jsonRequest) {
	// Prefer shutdown when it is already closed so this path is
	// deterministic. This mirrors addRequest and avoids post-shutdown
	// enqueueing.
	select {
	case <-c.shutdown:
		jReq.responseChan <- &Response{
			result: nil,
			err:    ErrClientShutdown,
		}

		return

	default:
	}

	// Normal path: either enqueue, or fail if shutdown closes in the race
	// window after the guard above.
	select {
	case c.sendPostChan <- jReq:
		log.Tracef("Sent command [%s] with id %d", jReq.method, jReq.id)

	case <-c.shutdown:
		jReq.responseChan <- &Response{
			result: nil,
			err:    ErrClientShutdown,
		}
	}
}


// Resolve all pending futures on the first batch-level failure so
// callers waiting on Receive don't block indefinitely.
req.responseChan <- &Response{err: err}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: This send is safe because in batch mode sendRequest only calls addRequest (never sendPostRequest), so individual responseChan buffers (size 1) are guaranteed unwritten at this point. A brief comment documenting this invariant would help future readers — e.g.:

// Safe: batch-mode responseChan buffers are unwritten here,
// so this send won't block while locks are held.
req.responseChan <- &Response{err: err}

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a comment:

		// Resolve all pending futures on the first batch-level failure
		// so callers waiting on Receive don't block indefinitely.
		// Safe: batch-mode responseChan buffers are unwritten here,
		// so this send won't block while locks are held. Batch-mode
		// requests only use addRequest (not sendPostRequest), so each
		// responseChan buffer is still empty.
		req.responseChan <- &Response{err: err}

}

c.requestMap = make(map[uint64]*list.Element)
c.requestList.Init()
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

question: In batch mode, addRequest pushes to batchList, never requestList, so this should already be empty. Intentional defensive reset, or leftover? If intentional, a quick comment would clarify.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is a defensive reset. Added a comment:

	// Batch-mode requests are tracked in batchList, so requestList should
	// already be empty. Keep this defensive reset for invariants and future
	// call paths.
	c.requestList.Init()

@saubyk saubyk added this to the v0.25.1 milestone Mar 23, 2026
Copy link
Copy Markdown
Member

@Roasbeef Roasbeef left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Change looks good, only comment is why we'd move away from the pattern that prioritizes a cancel path.

Comment on lines 931 to 942
// Atomically either queue the request or fail it due to shutdown.
//
// This avoids delivering two terminal responses to the same request, which
// can otherwise block shutdown cleanup on the second send.
select {
case <-c.shutdown:
jReq.responseChan <- &Response{result: nil, err: ErrClientShutdown}
default:
}
return

select {
case c.sendPostChan <- jReq:
log.Tracef("Sent command [%s] with id %d", jReq.method, jReq.id)

case <-c.shutdown:
return
}
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is effectively how the code already was.

@starius isn't it better to prioritize the shutdown path?

wydengyre and others added 4 commits March 24, 2026 22:48
Use a shutdown-aware context for HTTP POST handling so shutdown can
interrupt in-flight requests.

Centralize shutdown error remapping in handleSendPostMessage so all
error exits consistently return ErrClientShutdown when shutdown causes
a context cancellation. Move the retrying HTTP calling code to a free
function handleSendPostMessageWithRetry and cover it with tests.
When shutdown races with sendPostRequest, a request could be marked
as ErrClientShutdown and still be enqueued. The sendPostHandler cleanup
loop would then try to send a second terminal response and could block
forever on a full response channel.

Fix this by prioritizing the shutdown path. First check shutdown with a
non-blocking select and return immediately when it is already closed.
Then use a second select to choose between enqueue and shutdown for the
remaining race window.

A regression test verifies a shutdown request is failed immediately and
never enqueued.
Batch requests were only clearing batchList on Send() errors. The
per-request futures remained unresolved, so callers waiting on Receive
could block forever after a failed batch round trip.

Add failBatchRequests to fan out the Send() error to every queued batch
request and clear tracking state in one place. A regression test now
verifies queued futures complete with the same error returned by Send().
NewBatch called New() and then called start() again. In HTTP POST mode that
created a second sendPostHandler and another shutdown-cancel goroutine, which
broke the expected single-flight serialization of POST sends.

Keep NewBatch as a semantic toggle only: rely on New() to start handlers
once, then set batch=true. A regression test now checks that batch POST
requests stay serialized through one active transport call.
@starius
Copy link
Copy Markdown
Contributor Author

starius commented Mar 25, 2026

I addressed the remaining comments in #2451 and added the updated version of it here as commit "rpcclient: support canceling in-flight http requests". Functionally it is the same, but refactored. I simplified the code by making a function that returns ([]byte, error); error remapping and channel sending done is done on the call site.

CC @Roasbeef @seeforschauer @wydengyre

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants