Skip to content

fix(syncer): include in-flight and pending-cache work in PendingCount#3162

Merged
alpe merged 3 commits intomainfrom
fix/catchup-startup-race-main
Mar 13, 2026
Merged

fix(syncer): include in-flight and pending-cache work in PendingCount#3162
alpe merged 3 commits intomainfrom
fix/catchup-startup-race-main

Conversation

@alpe
Copy link
Contributor

@alpe alpe commented Mar 13, 2026

Overview

Fixes a race on startup

  • runCatchup lost the local <= highestSeen bound and now scans upward until ErrHeightFromFuture, then marks head reached.

  • PendingCount() only returns len(heightInCh), so it ignores in-flight work (and pending-cache backlog), letting failover
    stop too early.

Summary by CodeRabbit

  • Bug Fixes
    • Fixed race condition during startup synchronization.

@github-actions
Copy link
Contributor

github-actions bot commented Mar 13, 2026

The latest Buf updates on your PR. Results from workflow CI / buf-check (pull_request).

BuildFormatLintBreakingUpdated (UTC)
✅ passed⏩ skipped✅ passed✅ passedMar 13, 2026, 1:09 PM

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Mar 13, 2026

📝 Walkthrough

Walkthrough

The changes implement in-flight event tracking in the Syncer struct alongside pending events in the cache, adding a new PendingEventsCount() method to the cache manager interface and a helper itemCount() method. The PendingCount() calculation is updated to account for in-flight events, pending cache events, and channel length. The CHANGELOG documents the race condition fix.

Changes

Cohort / File(s) Summary
Cache Event Counting
block/internal/cache/generic_cache.go, block/internal/cache/manager.go
Adds private itemCount() helper method and public PendingEventsCount() interface method to track the number of pending events in the cache.
Syncer In-Flight Tracking
block/internal/syncing/syncer.go
Introduces atomic inFlight field to the Syncer struct and increments/decrements it during heightEvent processing; updates PendingCount() to include in-flight event count and cache's pending events count.
Changelog Update
CHANGELOG.md
Reformats entries with line-wrapping and indentation adjustments; adds "Fixed" subsection under Unreleased documenting the race condition fix on startup sync.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Possibly related PRs

  • fix(node): race on caught up #3133: Addresses the same "caught up" race condition by modifying how Syncer's pending-count is calculated, with this PR extending the counting mechanism to include in-flight events and cache pending events.

Suggested reviewers

  • julienrbrt

Poem

🐰 The syncer now tracks flights in air,
With cache counts beyond compare,
No race can catch this hoppy friend,
In-flight events—race condition's end! ✨

🚥 Pre-merge checks | ✅ 3
✅ Passed checks (3 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly summarizes the main change: adding in-flight and pending-cache work to PendingCount to fix a race condition on startup.
Description check ✅ Passed The description provides context and explains the fixes including the race condition and the PendingCount issue, though it could be more detailed about the specific implementation.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
  • 📝 Generate docstrings (stacked PR)
  • 📝 Generate docstrings (commit on current branch)
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch fix/catchup-startup-race-main
📝 Coding Plan
  • Generate coding plan for human review comments

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@claude
Copy link
Contributor

claude bot commented Mar 13, 2026

Claude finished @alpe's task in 3m 16s —— View job


Code Review

  • Gather context and understand the PR
  • Read changed files
  • Analyze code quality, concurrency patterns, and correctness
  • Post review feedback

Summary

This is a small, focused fix that correctly addresses an under-counting bug in PendingCount(). The fix is sound, thread-safe, and idiomatic Go. Overall the changes are good — a few notes below.


Correctness

inFlight counter (syncer.go:75, syncer.go:383-385)

The atomic.Int64 is incremented before processHeightEvent and decremented after within the processLoop goroutine. Since processLoop is a single goroutine, the counter oscillates between 0 and 1 and never goes negative. The increment/decrement bracketing is correct and safe.

PendingCount() fix (syncer.go:409)

The original len(s.heightInCh) missed two categories of work:

  1. The event actively being processed in processHeightEvent (now: inFlight)
  2. Events parked in the pending-cache backlog awaiting sequential ordering (now: cache.PendingEventsCount())

The new formula len(s.heightInCh) + int(s.inFlight.Load()) + s.cache.PendingEventsCount() is strictly correct for the failover drain check in node/failover.go:318.

PendingEventsCount() / itemCount() (manager.go:325, generic_cache.go:117)

lru.Cache.Len() is thread-safe (uses internal mutex), so PendingEventsCount() is safe to call concurrently. The interface extension is internal, so no public API breakage.


Observations

1. Non-atomic snapshot — theoretical narrow race

// syncer.go:409
return len(s.heightInCh) + int(s.inFlight.Load()) + s.cache.PendingEventsCount()

The three reads are not taken under a single lock. There is a tiny window where an event has been dequeued from heightInCh but inFlight has not yet been incremented (between case heightEvent, ok := <-s.heightInCh: and s.inFlight.Add(1)), which would transiently yield a count of 0 when the real count is 1. The failover polling loop (failover.go:318) checks PendingCount() == 0 on a ticker, so a single transient undercount could theoretically trigger a premature drain decision.

In practice, the risk is negligible (the window is nanoseconds and the polling interval is much longer), but worth documenting. If robustness is desired later, one option is to increment inFlight before the select (or move to a single mutex), but this is not blocking for this fix. Fix this →

2. atomic.Int64 vs atomic.Bool for inFlight

Since processLoop is single-goroutine and the counter can only be 0 or 1, an atomic.Bool would communicate intent more clearly. That said, the current atomic.Int64 keeps the PendingCount arithmetic clean and is not incorrect — this is a style-only nit.

3. Missing test for PendingCount() three-term formula

There is no unit test covering the updated PendingCount(). Given this is a targeted bug fix, a test that verifies all three components are counted (e.g., confirming PendingCount() > 0 while an event is in-flight or in the pending-cache) would harden the fix against regression. Fix this →

4. PR description mentions a runCatchup change that isn't present

The PR body states: "runCatchup lost the local <= highestSeen bound and now scans upward until ErrHeightFromFuture, then marks head reached." However, block/internal/syncing/da_follower.go is not among the changed files, and the current runCatchup still retains the local > highest guard:

// da_follower.go:351-355
if highest > 0 && local > highest {
    // Caught up.
    f.headReached.Store(true)
    return
}

This description appears to be leftover from an earlier iteration of the PR. Updating the PR description to match the actual changes would help reviewers.


Verdict

The core fix is correct and safe. The changes are minimal, well-scoped, and address the stated race. The main actionable items are: (1) add a regression test for PendingCount(), and (2) update the PR description. Approving is reasonable once those are addressed.

@alpe alpe marked this pull request as ready for review March 13, 2026 13:06
* main:
  chore: re-add replaces for development (#3161)
@codecov
Copy link

codecov bot commented Mar 13, 2026

Codecov Report

❌ Patch coverage is 28.57143% with 5 lines in your changes missing coverage. Please review.
✅ Project coverage is 60.19%. Comparing base (03fef96) to head (478bc84).
⚠️ Report is 1 commits behind head on main.

Files with missing lines Patch % Lines
block/internal/cache/generic_cache.go 0.00% 2 Missing ⚠️
block/internal/cache/manager.go 0.00% 2 Missing ⚠️
block/internal/syncing/syncer.go 66.66% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #3162      +/-   ##
==========================================
- Coverage   60.20%   60.19%   -0.02%     
==========================================
  Files         115      115              
  Lines       11896    11902       +6     
==========================================
+ Hits         7162     7164       +2     
- Misses       3923     3927       +4     
  Partials      811      811              
Flag Coverage Δ
combined 60.19% <28.57%> (-0.02%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🧹 Nitpick comments (1)
block/internal/cache/manager.go (1)

61-61: Add a one-line doc for the new exported interface method.

Line 61 introduces a new exported API method; adding method-level docs keeps API docs complete and consistent.

Suggested tweak
 	// Pending events syncing coordination
 	GetNextPendingEvent(blockHeight uint64) *common.DAHeightEvent
 	SetPendingEvent(blockHeight uint64, event *common.DAHeightEvent)
+	// PendingEventsCount returns the number of pending events currently cached by height.
 	PendingEventsCount() int

As per coding guidelines "Document exported types and functions".

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@block/internal/cache/manager.go` at line 61, The exported interface method
PendingEventsCount on the manager interface is missing a one-line doc; add a
concise method-level comment above PendingEventsCount() int describing what it
returns (e.g., "PendingEventsCount returns the number of events currently queued
for processing") so the exported API is documented and follows the project's
"Document exported types and functions" guideline.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@block/internal/syncing/syncer.go`:
- Around line 383-385: The check in waitForCatchup() that treats a single
PendingCount()==0 as terminal can be fooled by transient handoffs between the
buffered channel, inFlight (s.inFlight), and cache; update waitForCatchup() to
require stability before returning by sampling PendingCount() multiple times
(e.g., N consecutive zero samples spaced by the current poll interval) or add an
additional guard such as verifying inFlight==0 and the buffered channel is empty
after a short debounce window; reference PendingCount(), waitForCatchup(),
s.inFlight/Add and processHeightEvent() to locate where to implement the
repeated-sampling or extra-check logic so the syncer does not stop prematurely.

In `@CHANGELOG.md`:
- Around line 71-73: Fix the typo "accross" to "across" in the changelog entry
that begins "**BREAKING** Make pending events cache and tx cache fully
ephemeral..." (search for the exact word "accross" in that paragraph) so the
sentence reads "Persist across restart using store metadata.".
- Around line 316-318: Remove the stray orphan list marker before the
[Unreleased] link in CHANGELOG.md by deleting the solitary "-" line so the
Markdown does not render an empty bullet; locate the block containing the "-"
directly above the "[Unreleased]:
https://github.com/evstack/ev-node/compare/v1.0.0-beta.1...HEAD" link and remove
that dash.

---

Nitpick comments:
In `@block/internal/cache/manager.go`:
- Line 61: The exported interface method PendingEventsCount on the manager
interface is missing a one-line doc; add a concise method-level comment above
PendingEventsCount() int describing what it returns (e.g., "PendingEventsCount
returns the number of events currently queued for processing") so the exported
API is documented and follows the project's "Document exported types and
functions" guideline.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: dc298eea-ae1a-40f9-9451-c9882c0705d3

📥 Commits

Reviewing files that changed from the base of the PR and between 03fef96 and 478bc84.

📒 Files selected for processing (4)
  • CHANGELOG.md
  • block/internal/cache/generic_cache.go
  • block/internal/cache/manager.go
  • block/internal/syncing/syncer.go

Comment on lines +383 to +385
s.inFlight.Add(1)
s.processHeightEvent(ctx, &heightEvent)
s.inFlight.Add(-1)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
set -euo pipefail

# Locate PendingCount definition and all call sites.
rg -nP --type=go '\bPendingCount\s*\(' -C 4

# Find direct zero checks on PendingCount.
rg -nP --type=go 'PendingCount\s*\(\)\s*(==|<=|<)\s*0|0\s*(==|>=|>)\s*PendingCount\s*\(\)' -C 6

Repository: evstack/ev-node

Length of output: 1937


🏁 Script executed:

# Get more context around the failover check at line 318
rg -nP --type=go 'pipelineDrained|PendingCount' block/internal/syncing/syncer.go node/failover.go -A 10 -B 10

Repository: evstack/ev-node

Length of output: 2608


🏁 Script executed:

# Check if there's a polling loop or repeated checks for this condition
rg -nP --type=go 'catchupTimeout|catchup.*loop' node/failover.go -C 8

Repository: evstack/ev-node

Length of output: 2668


🏁 Script executed:

# Search for any debouncing or stability check patterns in the codebase
rg -nP --type=go 'stable|debounce|window|consistent|verify.*zero' -C 3

Repository: evstack/ev-node

Length of output: 29878


🏁 Script executed:

# Check what function contains lines 318-324 and what calls it
rg -nP --type=go 'func.*catchup|return true, nil' node/failover.go -B 20 -A 2 | head -80

Repository: evstack/ev-node

Length of output: 838


🏁 Script executed:

# Search for any test coverage or repeated checks for pipelineDrained
rg -nP --type=go 'pipelineDrained|PendingCount.*==.*0' test/ -C 3

Repository: evstack/ev-node

Length of output: 41


🏁 Script executed:

# Get the full function signature and scope containing the catchup logic
rg -nP --type=go 'func.*failoverState.*catchup' node/failover.go -A 100 | head -120

Repository: evstack/ev-node

Length of output: 41


🏁 Script executed:

# Check if there are any retry mechanisms or stability checks after the return
rg -nP --type=go 'return true' node/failover.go -B 5 -A 5

Repository: evstack/ev-node

Length of output: 336


🏁 Script executed:

# Get the full function definition
rg -nP --type=go 'func \(f \*failoverState\).*\(' node/failover.go -A 150 | head -200

Repository: evstack/ev-node

Length of output: 7050


🏁 Script executed:

# Check the exact function name and context
sed -n '260,330p' node/failover.go

Repository: evstack/ev-node

Length of output: 2049


Verify that callers don't treat a single PendingCount()==0 sample as terminal.

Line 318-319 in waitForCatchup() immediately returns upon first successful check without verifying stable-zero conditions. Since PendingCount() aggregates three independent sources (buffered channel, atomic in-flight counter, cache), a transient zero is possible during event handoff—particularly when inFlight cycles between processing states. The polling loop (~100ms interval) provides natural debouncing but no explicit stability requirement. Once the check succeeds, the syncer stops (line 268) and block production begins; the decision cannot be retried. Confirm that either repeated stable-zero sampling or additional guards prevent premature pipeline drain declaration.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@block/internal/syncing/syncer.go` around lines 383 - 385, The check in
waitForCatchup() that treats a single PendingCount()==0 as terminal can be
fooled by transient handoffs between the buffered channel, inFlight
(s.inFlight), and cache; update waitForCatchup() to require stability before
returning by sampling PendingCount() multiple times (e.g., N consecutive zero
samples spaced by the current poll interval) or add an additional guard such as
verifying inFlight==0 and the buffered channel is empty after a short debounce
window; reference PendingCount(), waitForCatchup(), s.inFlight/Add and
processHeightEvent() to locate where to implement the repeated-sampling or
extra-check logic so the syncer does not stop prematurely.

Comment on lines +71 to +73
- **BREAKING** Make pending events cache and tx cache fully ephemeral. Those will be re-fetched on restart. DA Inclusion
cache persists until cleared up after DA inclusion has been processed. Persist accross restart using store
metadata. ([#3047](https://github.com/evstack/ev-node/pull/3047))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Fix typo in historical changelog text.

Line 72 uses “accross”; it should be “across”.

Proposed fix
-  cache persists until cleared up after DA inclusion has been processed. Persist accross restart using store
+  cache persists until cleared up after DA inclusion has been processed. Persist across restart using store
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@CHANGELOG.md` around lines 71 - 73, Fix the typo "accross" to "across" in the
changelog entry that begins "**BREAKING** Make pending events cache and tx cache
fully ephemeral..." (search for the exact word "accross" in that paragraph) so
the sentence reads "Persist across restart using store metadata.".

Comment on lines +316 to +318
-

[Unreleased]: https://github.com/evstack/ev-node/compare/v1.0.0-beta.1...HEAD
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Remove stray empty bullet before the [Unreleased] link.

Line 316 leaves an orphan -, which renders as an empty bullet in markdown.

Proposed fix
--
-
 [Unreleased]: https://github.com/evstack/ev-node/compare/v1.0.0-beta.1...HEAD
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
-
[Unreleased]: https://github.com/evstack/ev-node/compare/v1.0.0-beta.1...HEAD
[Unreleased]: https://github.com/evstack/ev-node/compare/v1.0.0-beta.1...HEAD
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@CHANGELOG.md` around lines 316 - 318, Remove the stray orphan list marker
before the [Unreleased] link in CHANGELOG.md by deleting the solitary "-" line
so the Markdown does not render an empty bullet; locate the block containing the
"-" directly above the "[Unreleased]:
https://github.com/evstack/ev-node/compare/v1.0.0-beta.1...HEAD" link and remove
that dash.

@alpe alpe requested a review from julienrbrt March 13, 2026 13:38
@alpe alpe added this pull request to the merge queue Mar 13, 2026
Merged via the queue into main with commit e4e35bf Mar 13, 2026
35 of 36 checks passed
@alpe alpe deleted the fix/catchup-startup-race-main branch March 13, 2026 14:16
alpe added a commit that referenced this pull request Mar 13, 2026
* main:
  fix(syncer): include in-flight and pending-cache work in PendingCount (#3162)
  chore: re-add replaces for development (#3161)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants