fix: add node health check to incr/decr before mutate operation by AustinWheel · Pull Request #186 · Netflix/EVCache

AustinWheel · 2026-03-03T17:38:57Z

incr/decr operations bypassed the ensureWriteQueueSize check that other
write paths (set, add, touch) use, causing them to block on unreachable
or stalled nodes for the full mutateOperationTimeout duration. A single
unhealthy node could cause all incr/decr calls to block, exhausting the
caller's thread pool.

This adds the same ensureWriteQueueSize gate to incr/decr. If the target
node is inactive or its write queue is full, the operation returns -1
immediately, consistent with the documented API contract. The existing
reconciliation logic in EVCacheImpl.incr()/decr() already handles -1 as
a zone failure and will sync the value from healthy zones on the next
successful operation.

incr/decr operations bypassed the ensureWriteQueueSize check that other write paths (set, add, touch) use, causing them to block on unreachable or stalled nodes for the full mutateOperationTimeout duration. A single unhealthy node could cause all incr/decr calls to block, exhausting the caller's thread pool. This adds the same ensureWriteQueueSize gate to incr/decr. If the target node is inactive or its write queue is full, the operation returns -1 immediately, consistent with the documented API contract. The existing reconciliation logic in EVCacheImpl.incr()/decr() already handles -1 as a zone failure and will sync the value from healthy zones on the next successful operation.

shy-1234 · 2026-03-05T16:57:27Z

evcache-core/src/main/java/com/netflix/evcache/pool/EVCacheClient.java


    public long incr(String key, long by, long defaultVal, int timeToLive) throws EVCacheException {
+        final MemcachedNode node = evcacheMemcachedClient.getEVCacheNode(key);
+        if (!ensureWriteQueueSize(node, key, Call.INCR)) {


+1 to adding the check here. On the other hand, I am thinking we probably should do the evcNode.isAvailable(call) check first like in validateNodeForRead to fast fail instead of going through the retry wait loop if the node is already inactive. I couldn't think of a reason why we did this only for read APIs, but the same should also apply to write APIs IIUC. Any thought?

I think the goal is to not lose writes, the write queue is considerably larger than reads, and if a node went out of discovery for some small time and it's write queue isn't full, we could still add operations which it could process after returning

That's a good point. However, I wonder is that better than just fast fail those writes. EVCache writes go to multiple zones/replicas. A failed write to one node doesn't mean data loss. The failed result will be reflected in the latch count and the client's LatchPolicy will check if they can tolerant one or some failure.

And for a high throughput workload, there can be lots of request threads retrying/waiting (threads are blocked on sleep) before the connection can be rebuilt (there is also a backoff time between each reconnect, which makes this wait even longer I think).

okay yeah I think it could be useful for incr/decr operations, but should such a change be added to all write ops?

… write path - EVCacheImpl.decr: fix reconciliation to pick the minimum non-(-1) value across nodes instead of the maximum. For decr, the most up-to-date node has the lowest value (most decremented), so the old max-pick logic would overwrite correctly decremented nodes with stale higher values. - EVCacheClient.ensureWriteQueueSize: add isAvailable() check before entering the retry/sleep loop. This fast-fails writes to inactive nodes instead of blocking request threads through 3 retry iterations. Affects all write operations that go through ensureWriteQueueSize (incr, decr, set, delete, etc.).

AustinWheel requested review from Sunjeet, akashdeepgoel and shy-1234 March 3, 2026 17:38

shy-1234 reviewed Mar 5, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: add node health check to incr/decr before mutate operation#186

fix: add node health check to incr/decr before mutate operation#186
AustinWheel wants to merge 2 commits intomasterfrom
awheeler/fix/incr-decr-tolerance

AustinWheel commented Mar 3, 2026

Uh oh!

shy-1234 Mar 5, 2026

Uh oh!

AustinWheel Mar 5, 2026

Uh oh!

shy-1234 Mar 5, 2026

Uh oh!

AustinWheel Mar 6, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

AustinWheel commented Mar 3, 2026

Uh oh!

shy-1234 Mar 5, 2026

Choose a reason for hiding this comment

Uh oh!

AustinWheel Mar 5, 2026

Choose a reason for hiding this comment

Uh oh!

shy-1234 Mar 5, 2026

Choose a reason for hiding this comment

Uh oh!

AustinWheel Mar 6, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants