Skip to content

fix: offload index sync store to blocking task#841

Open
dknopik wants to merge 5 commits intosigp:unstablefrom
dknopik:offload-sync-store
Open

fix: offload index sync store to blocking task#841
dknopik wants to merge 5 commits intosigp:unstablefrom
dknopik:offload-sync-store

Conversation

@dknopik
Copy link
Member

@dknopik dknopik commented Feb 25, 2026

Issue Addressed

During longer historic sync (e.g. initial sync) the following error would often occur:

2026-02-25T15:59:34.326920Z ERROR Unable to read spec from beacon node              error=HttpClient(url: http://localhost:5052/, kind: timeout, detail: operation timed out) endpoint=http://localhost:5052/
2026-02-25T15:59:34.326984Z  WARN A connected beacon node errored during routine health check error=Offline endpoint=http://localhost:5052/

The underlying issue is that the tokio runtime was blocked - preventing us from handling the http request/response and causing the timeout. The actual culprit is the index syncer: It is async, but may block in it's last step: storing the indices into the database. During historic sync, this block might take long enough to cause noticeable issues like the issue above.

Proposed Changes

Use spawn_blocking to move this operation into it's own thread. This conveniently also allows us to continue fetching indices while the store operation waits. Additionally, we use an Arc to track waiting store operations so that the mechanism fetching missing indices does not fetch the same indices over and over again while the store operation is waiting.

@dknopik dknopik added bug Something isn't working ready-for-review This PR is ready to be reviewed v2.0.0 The release shipping the next network upgrade labels Feb 25, 2026
let space = MAX_BATCH_SIZE - batch.len();
if space > 0 {
// Only do this if we have any space remaining and there are no store tasks that might wait
// to write missing indices. If the count is 1, only we hold the Arc (no other tasks).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit (non-blocking): Arc::strong_count uses Relaxed ordering and is inherently subject to TOCTOU — the count can change between the read and the branch. In this case the consequences are benign (an extra or skipped DB sweep), so it's not a correctness issue.

Still, an AtomicUsize would express the intent more clearly and avoid future confusion:

let pending_stores = Arc::new(AtomicUsize::new(0));
// ...
if space > 0 && pending_stores.load(Ordering::Acquire) == 0 {

Up to you whether it's worth changing — the current code works fine in practice.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the ordering makes only little difference here if I understand correctly.

I understand the argument to use a AtomicUsize for clarity though. What do you prefer?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe AtomicUsize would be better


// `set_validator_indices` may block as it start a database transaction and updates the
// in memory database. We do not want to do that on the async runtime, so we
// spawn a blocking task.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: The comment explains what but not why — a reader unfamiliar with the PR context wouldn't know why blocking matters here. Consider something like:

// `set_validator_indices` performs synchronous SQLite I/O and takes a write lock
// on in-memory state, which would block the Tokio worker thread and starve other
// async tasks (e.g., beacon node health checks). Offload to the blocking thread pool.

Also minor typo: "it start" → "it starts".

Copy link
Member

@diegomrsantos diegomrsantos left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for handling this, great catch. Mostly good, only two small nits.

@dknopik
Copy link
Member Author

dknopik commented Feb 27, 2026

@diegomrsantos

Thanks for the review, I addressed it. I do not think the Ordering matters at all:

Memory orderings specify the way atomic operations synchronize memory. In its weakest Ordering::Relaxed, only the memory directly touched by the operation is synchronized. On the other hand, a store-load pair of Ordering::SeqCst operations synchronize other memory while additionally preserving a total order of such operations across all threads.

from https://doc.rust-lang.org/std/sync/atomic/enum.Ordering.html

We do not really touch any other shared memory beyond the AtomicUsize, so I just opted for Relaxed, the simplest option.

if space > 0 {
// Only do this if we have any space remaining and there are no store tasks that might wait
// to write missing indices. If the count is 1, only we hold the Arc (no other tasks).
if space > 0 && waiting_store_tasks.load(Ordering::Relaxed) == 1 {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

waiting_store_tasks starts at 0, increments before spawning each blocking write, and decrements when it finishes.
So should this condition check == 0 (not == 1) if we only want DB fallback when no store writes are currently in flight?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ahh, damn, good catch, forgot to change this when editing earlier

@diegomrsantos
Copy link
Member

Thanks for the follow-up — this is a good direction, and offloading set_validator_indices makes sense.

If you’re open to a small readability pass, I think these changes would help:

  • rename intent-driven identifiers, e.g. pending_store_writes, has_inflight_store_writes, missing_index_scan_cursor
  • add a small helper/predicate (fn has_inflight_store_writes(...) -> bool) and gate with that name
  • keep comments on why fallback is gated (avoid redundant DB candidates while writes are active)
  • extract the DB fallback scan into a tiny helper with a brief doc comment
  • add a small regression test that verifies DB fallback scan is skipped while a store task is in flight

The core idea is solid; this will make the intent much clearer and safer for future edits.

@dknopik
Copy link
Member Author

dknopik commented Mar 11, 2026

Thanks! Addressed everything except the has_inflight_store_writes predicate — renamed the identifiers, added the gating comments, extracted the DB fallback scan helper, and added the regression test.

I skipped the predicate because the check is only used in one spot and reads clearly inline. Let me know if you'd still prefer it extracted.

@diegomrsantos
Copy link
Member

I think this PR is a good tactical fix for the immediate issue, but I also think it is important to frame it as a workaround, not the real fix.

What this PR fixes is clear: the index syncer is async, but it was still doing blocking SQLite work on the runtime thread. Moving that store step into spawn_blocking is the right local mitigation for that.

However, I think this is the same underlying design problem that came up in #735, just showing up in a different way.

In #735, the problem shows up as a correctness issue:

  • shared in-memory state is updated before we have a clean commit/publication boundary
  • side effects can escape before commit
  • failure handling becomes hard to reason about

Here, the same problem shows up as an execution-model issue:

  • async code is still directly responsible for database-backed writes
  • so we need ad hoc spawn_blocking fixes at individual call sites

In both cases, the root issue is that there is no single store boundary that owns:

  • blocking database execution
  • durable commit
  • publication of committed in-memory state

So I view this PR as a reasonable stopgap, but not the architectural fix.

The real fix, as discussed in #735, is to move toward a model where:

  1. async tasks do not write to SQLite directly
  2. they send commands / batch plans to a centralized store owner
  3. that store owner runs off the async runtime thread
  4. it commits durable state first
  5. it publishes the new in-memory snapshot exactly once, after commit

That would solve this class of problem at the right layer instead of requiring local mitigations like this one.

So from my side: this looks like a sensible short-term fix, but I would not want us to treat it as the end-state design.

@dknopik
Copy link
Member Author

dknopik commented Mar 12, 2026

Sure. :)

Everything we work on has potential to be improved. Nothing I ever propose is implicitly supposed to be final. We improve the code incrementally.

So from my side: this looks like a sensible short-term fix, but I would not want us to treat it as the end-state design.

I agree.

Is there anything else you want to have addressed for this workaround PR?

@diegomrsantos
Copy link
Member

diegomrsantos commented Mar 12, 2026

Thanks for working on this. After looking at the current shape of the workaround, I think we should close this PR and solve the problem properly instead.

The easiest way to see the issue is to compare the control flow before and after this PR.

Before this PR, index_sync effectively did this:

fetch indices from BN -> write them to SQLite inline -> only then continue

That was bad because the SQLite write could block the async runtime thread.

But it also had one useful property: natural backpressure. The syncer could not get ahead of the database.

After this PR, the flow becomes:

fetch indices from BN -> spawn_blocking(write to SQLite) -> immediately continue fetching

That fixes the local runtime-blocking symptom, but it removes the old backpressure.

The problem is that the database is still single-writer in practice. NetworkDatabase still uses a single SQLite connection / single writer model, so spawn_blocking does not make the writes meaningfully concurrent. It only makes them queue up off-thread behind one SQLite connection.

So during historical sync, we can now do something like this:

fetch batch 1 -> enqueue writer 1
fetch batch 2 -> enqueue writer 2
fetch batch 3 -> enqueue writer 3

But only one writer can actually run against SQLite. The rest just sit and wait. If we produce batches faster than SQLite can drain them, the queue grows. At that point we have traded:

  • old problem: blocking the runtime thread
  • new problem: building a backlog of writer tasks behind a single DB connection

That is not just a performance detail. It changes the failure mode. Late tasks can now wait a long time for the connection and potentially fail under load.

There is a second issue as well. The new pending_store_writes gating suppresses the DB fallback scan while writes are in flight. That avoids duplicate work, but it also means validators that are already in the DB with index == None now depend on that heuristic before they are retried. If queue-driven work keeps arriving, those older missing-index validators may be delayed for a long time. That makes liveness/progress harder to reason about than before.

So my concern is not that the PR identified the wrong symptom. The symptom is real: blocking SQLite work should not run on the async runtime.

My concern is that this workaround is now getting complex enough that it is changing write ordering, backpressure, and retry/progress behavior locally inside index_sync, while the real problem is architectural.

For that reason, I think we should close this PR rather than keep refining the workaround.

I'm already working on the proper fix: a dedicated store owner / centralized write boundary that preserves serialization and backpressure explicitly, instead of trying to recover them with local heuristics inside index_sync.

That direction addresses the runtime-blocking issue here and also moves us toward the broader persistence/cache-coherence model we discussed in #735.

Given that, I do not think it is worth landing this workaround in its current form.

@dknopik
Copy link
Member Author

dknopik commented Mar 13, 2026

We can also immediately await the offloaded task if that is what you prefer. That seems to be what you are doing in #885. We should regain backpressure that way. Let me know what you think.

The clear advantage of merging this is that we have this fixed for now. I feel larger refactors might be contentious and I want to avoid to be blocked on that as the Boole release approaches. This also aligns with our desire to have smaller PRs.

The effort to merge this "workaround" is miniscule, now that it exists. We are on track to spend a lot of time on discussing whether to merge something that we both consider to be at least an improvement over the current situation.

@shane-moore what is your opinion?

@shane-moore
Copy link
Member

We can also immediately await the offloaded task if that is what you prefer. That seems to be what you are doing in #885. We should regain backpressure that way. Let me know what you think.

The clear advantage of merging this is that we have this fixed for now. I feel larger refactors might be contentious and I want to avoid to be blocked on that as the Boole release approaches. This also aligns with our desire to have smaller PRs.

The effort to merge this "workaround" is miniscule, now that it exists. We are on track to spend a lot of time on discussing whether to merge something that we both consider to be at least an improvement over the current situation.

@shane-moore what is your opinion?

when reading the beginning of the pr description about tokio runtime being blocked, my first thought was spawn_blocking. then, i read a few sentences later that is what you did ha

and I read through the threads, interesting points about losing backpressure, and yah, just doing an await for the spawn blocking work to complete seems a solid way to regain it. think you could kill the AtomicUsize in that case as well. tradeoff is that we lose the concurrency of fetching BN batches while waiting on db writes. but i think thats fine. I'm happy as long as tokio runtime gets unblocked tbh

then, I don't see why not merge this in the spirit of continuous improvement, which y'all have already discussed. and then iterating/rearchitecting in another pr which is already happening. A phrase I like is, "progress over perfection"

@diegomrsantos
Copy link
Member

I agree with Shane’s point that, for the immediate problem here, dropping the fetch/write overlap and just awaiting the blocking DB work is probably fine. I think that concern about "losing the concurrency of fetching BN batches while waiting on DB writes" is also a good example of the broader issue, though: that overlap may improve throughput, but so far I have not seen evidence that the benefit is large enough to justify the complexity it introduces.

The work on #885 gives us at least one concrete data point on that tradeoff. I reran historical sync on fresh --data-dirs, one run at a time, against the same Hoodi endpoints:

  • unstable: 46.94s
  • event-by-event branch: 53.43s

So the simpler design is somewhat slower on this setup, by about 6.5s / 14%. That is not irrelevant, but it is also not the kind of gap that, on its own, justifies the amount of complexity it introduced. We are not looking at an order-of-magnitude difference that would explain the current design.

More importantly, that complexity has not only contributed to the cache/publication issues discussed in #735, it also appears to have contributed to real persistence bugs. During this work I found that fresh unstable retains 8 orphan cluster rows in SQLite with no validators and no members. After digging into it, the failure mode is fairly concrete: unstable processes a whole fetched batch inside one outer transaction, ValidatorAdded inserts the cluster row before the validator row, and duplicate-validator failures (UNIQUE constraint failed: validators.validator_pubkey) are treated as malformed/skipped while the outer batch transaction is still committed later. So the partial cluster insert survives as a dead row. The event-by-event branch does not have that problem because it rejects duplicate ValidatorAdded earlier and no longer keeps a long-lived outer transaction alive across skipped events.

That is the broader concern I’ve been raising for a long time: we keep carrying complexity in the name of throughput/concurrency/caching without first establishing that it is actually needed. The result is usually weaker boundaries, less obvious ownership, and failure semantics that are much harder to reason about.

Given that, I don’t think #841 should be merged separately. #885 already addresses the main issue here in a broader and cleaner way: event processing and validator-index DB writes no longer run directly on async tasks, historical sync has a single bounded processor instead of overlapping DB work arbitrarily, and the database layer now owns the durable commit/publication boundary instead of spreading it across async callers. #885 is also already in a reviewable state from my perspective: it has been tested, benchmarked, and the major regressions found during that work were fixed. If there is still a strong preference to merge #841 as a narrow interim fix, that is fine, but I would see it as a stopgap rather than the direction we should keep building on.

Assuming #885 is the direction we want, the more useful next step would be to apply that same scrutiny to the full in-memory NetworkState snapshot: test whether it is providing meaningful benefit for the paths that actually matter, and if it is not, remove it and simplify the system further.

@shane-moore
Copy link
Member

Really cool and thorough how you ran the #885 against hoodi and were able to make legit comparisons for historical sync. Great finds with the current approach resulting in orphaned rows in the db. It feels pretty risk free to merge this after adding the spawn_blocking + .await as a stop gap as you said. and in parallel, i can start gaining familiarity with what's happening in #885 so i can understand why it could perhaps be the more long-term fix. give me some time on that as it looks like a rather big PR

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working ready-for-review This PR is ready to be reviewed v2.0.0 The release shipping the next network upgrade waiting-on-author

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants