Skip to content

Commit 5ca36c0

Browse files
committed
add proto
1 parent 8199fa4 commit 5ca36c0

File tree

2 files changed

+245
-10
lines changed

2 files changed

+245
-10
lines changed

docs/adr/ADR-023-sequencer-recovery.md

Lines changed: 135 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -22,7 +22,7 @@ Considered but not chosen for this iteration:
2222

2323
> We will operate **1 active + 1 failover** sequencer at all times, regardless of control plane. Two implementation options are approved:
2424
25-
- **Design A — Rafted Conductor (CFT)**: A sidecar *conductor* runs next to each `ev-node`. Conductors form a **Raft** cluster to elect a single leader and **gate** sequencing so only the leader may produce blocks. For quorum while running 1‑active/2‑failover semantics, we will run **1 sequencer nodes + 2 failover** (no sequencer) as the third Raft voter.
25+
- **Design A — Rafted Conductor (CFT)**: A sidecar *conductor* runs next to each `ev-node`. Conductors form a **Raft** cluster to elect a single leader and **gate** sequencing so only the Raft leader may produce blocks via the Admin Control API. Applicability: use Raft only when there are **≥ 3 sequencers** (prefer odd N: 3, 5, …). Do not use Raft for two-node 1‑active/1‑failover clusters; use Design B in that case.
2626
*Note:* OP Stack uses a very similar pattern for its sequencer; see `op-conductor` in References.
2727

2828
- **Design B — 1‑Active / 1‑Failover (Lease/Lock)**: One hot standby promotes itself when the active fails by acquiring a **lease/lock** (e.g., Kubernetes Lease or external KV). Strong **fencing** ensures the old leader cannot keep producing after lease loss.
@@ -51,15 +51,140 @@ Status of this decision: **Proposed** for implementation and test hardening.
5151
- **Design A (Raft)**: replicated **Raft log** entries for `UnsafeHead`, `LeadershipTerm`, and optional `CommitMeta` (batch/DA pointers); periodic snapshots.
5252
- **Design B (Lease)**: a single **Lease** record (Kubernetes Lease or external KV entry) plus a monotonic **lease token** for fencing.
5353

54-
### New/changed APIs
55-
Introduce an **Admin RPC** (gRPC/HTTP) on `ev-node` (or a thin shim) used by either control plane:
56-
57-
- `StartSequencer(from_unsafe_head: bool)` — start sequencing, optionally pinning to the last persisted UnsafeHead.
58-
- `StopSequencer()` — hard stop; no more block production.
59-
- `SequencerHealthy()``{ healthy, l2_number, l2_hash, l1_origin, peer_count, da_height, last_err }`
60-
- `Status()``{ sequencer_active, build_height, leader_hint?, last_err }`
61-
62-
These are additive and should not break existing RPCs.
54+
### Admin Control API (Protobuf)
55+
56+
We introduce a separate, authenticated Admin Control API dedicated to sequencing control. This API is not exposed on the public RPC endpoint and binds to a distinct listener (port/interface, e.g., `:8443` on an internal network or loopback-only in single-host deployments). It is used exclusively by the conductor/lease-manager and by privileged operator automation for break-glass procedures.
57+
58+
Service overview:
59+
- StartSequencer: Arms/starts sequencing subject to fencing (valid lease/term) and optionally pins to last persisted UnsafeHead.
60+
- StopSequencer: Hard stop with optional “force” semantics.
61+
- PrepareHandoff / CompleteHandoff: Explicit, auditable, two-phase, blue/green leadership transfer.
62+
- Health / Status: Health probes and machine-readable node + leader state.
63+
64+
Endpoint separation:
65+
- Public JSON-RPC and P2P endpoints remain unchanged.
66+
- Admin Control API is out-of-band and must not be routed through public ingress. It sits behind mTLS and strict network policy.
67+
68+
Protobuf schema (proposed file: `proto/evnode/admin/v1/control.proto`):
69+
70+
```
71+
syntax = "proto3";
72+
73+
package evnode.admin.v1;
74+
75+
option go_package = "github.com/evstack/ev-node/types/pb/evnode/admin/v1;adminv1";
76+
77+
// ControlService governs sequencer lifecycle and health surfaces.
78+
// All operations must be authenticated via mTLS and authorized via RBAC.
79+
service ControlService {
80+
// StartSequencer starts sequencing if and only if the caller holds leadership/fencing.
81+
rpc StartSequencer(StartSequencerRequest) returns (StartSequencerResponse);
82+
83+
// StopSequencer stops sequencing. If force=true, cancels in-flight loops ASAP.
84+
rpc StopSequencer(StopSequencerRequest) returns (StopSequencerResponse);
85+
86+
// PrepareHandoff transitions current leader to a safe ready-to-yield state
87+
// and issues a handoff ticket bound to the current term/unsafe head.
88+
rpc PrepareHandoff(PrepareHandoffRequest) returns (PrepareHandoffResponse);
89+
90+
// CompleteHandoff is called by the target node to atomically assume leadership
91+
// using the handoff ticket. Enforces fencing and continuity from UnsafeHead.
92+
rpc CompleteHandoff(CompleteHandoffRequest) returns (CompleteHandoffResponse);
93+
94+
// Health returns node-local liveness and recent errors.
95+
rpc Health(HealthRequest) returns (HealthResponse);
96+
97+
// Status returns leader/term, active/standby, and build info.
98+
rpc Status(StatusRequest) returns (StatusResponse);
99+
}
100+
101+
message UnsafeHead {
102+
uint64 l2_number = 1;
103+
bytes l2_hash = 2; // 32 bytes
104+
string l1_origin = 3; // opaque or hash/height string
105+
int64 timestamp = 4; // unix seconds
106+
}
107+
108+
message LeadershipTerm {
109+
uint64 term = 1; // monotonic term/epoch for fencing
110+
string leader_id = 2; // conductor/node ID
111+
}
112+
113+
message StartSequencerRequest {
114+
bool from_unsafe_head = 1; // if false, uses safe head per policy
115+
bytes lease_token = 2; // opaque, issued by control plane (Raft/Lease)
116+
string reason = 3; // audit string
117+
string idempotency_key = 4; // optional, de-duplicate retries
118+
string requester = 5; // principal for audit
119+
}
120+
message StartSequencerResponse {
121+
bool activated = 1;
122+
LeadershipTerm term = 2;
123+
UnsafeHead unsafe = 3;
124+
}
125+
126+
message StopSequencerRequest {
127+
bytes lease_token = 1;
128+
bool force = 2;
129+
string reason = 3;
130+
string idempotency_key = 4;
131+
string requester = 5;
132+
}
133+
message StopSequencerResponse {
134+
bool stopped = 1;
135+
}
136+
137+
message PrepareHandoffRequest {
138+
bytes lease_token = 1;
139+
string target_id = 2; // logical target node ID
140+
string reason = 3;
141+
string idempotency_key = 4;
142+
string requester = 5;
143+
}
144+
message PrepareHandoffResponse {
145+
bytes handoff_ticket = 1; // opaque, bound to term+unsafe head
146+
LeadershipTerm term = 2;
147+
UnsafeHead unsafe = 3;
148+
}
149+
150+
message CompleteHandoffRequest {
151+
bytes handoff_ticket = 1;
152+
string requester = 2;
153+
string idempotency_key = 3;
154+
}
155+
message CompleteHandoffResponse {
156+
bool activated = 1;
157+
LeadershipTerm term = 2;
158+
UnsafeHead unsafe = 3;
159+
}
160+
161+
message HealthRequest {}
162+
message HealthResponse {
163+
bool healthy = 1;
164+
uint64 l2_number = 2;
165+
bytes l2_hash = 3;
166+
string l1_origin = 4;
167+
uint64 peer_count = 5;
168+
uint64 da_height = 6;
169+
string last_err = 7;
170+
}
171+
172+
message StatusRequest {}
173+
message StatusResponse {
174+
bool sequencer_active = 1;
175+
string build_version = 2;
176+
string leader_hint = 3; // optional, human-readable
177+
string last_err = 4;
178+
LeadershipTerm term = 5;
179+
}
180+
```
181+
182+
Error semantics:
183+
- PERMISSION_DENIED: AuthN/AuthZ failure, missing or invalid mTLS identity.
184+
- FAILED_PRECONDITION: Missing/expired lease or fencing violation; handoff ticket invalid.
185+
- ABORTED: Lost leadership mid-flight; TOCTOU fencing triggered self-stop.
186+
- ALREADY_EXISTS: Start requested but sequencer already active with same term.
187+
- UNAVAILABLE: Local dependencies not ready (DA client, exec engine).
63188

64189
### Efficiency considerations
65190
- **Design A:** Raft heartbeats and snapshotting add small steady‑state overhead; no impact on throughput when healthy.
Lines changed: 110 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,110 @@
1+
syntax = "proto3";
2+
3+
package evnode.admin.v1;
4+
5+
option go_package = "github.com/evstack/ev-node/types/pb/evnode/admin/v1;adminv1";
6+
7+
// ControlService governs sequencer lifecycle and health surfaces.
8+
// All operations must be authenticated via mTLS and authorized via RBAC.
9+
service ControlService {
10+
// StartSequencer starts sequencing if and only if the caller holds leadership/fencing.
11+
rpc StartSequencer(StartSequencerRequest) returns (StartSequencerResponse);
12+
13+
// StopSequencer stops sequencing. If force=true, cancels in-flight loops ASAP.
14+
rpc StopSequencer(StopSequencerRequest) returns (StopSequencerResponse);
15+
16+
// PrepareHandoff transitions current leader to a safe ready-to-yield state
17+
// and issues a handoff ticket bound to the current term/unsafe head.
18+
rpc PrepareHandoff(PrepareHandoffRequest) returns (PrepareHandoffResponse);
19+
20+
// CompleteHandoff is called by the target node to atomically assume leadership
21+
// using the handoff ticket. Enforces fencing and continuity from UnsafeHead.
22+
rpc CompleteHandoff(CompleteHandoffRequest) returns (CompleteHandoffResponse);
23+
24+
// Health returns node-local liveness and recent errors.
25+
rpc Health(HealthRequest) returns (HealthResponse);
26+
27+
// Status returns leader/term, active/standby, and build info.
28+
rpc Status(StatusRequest) returns (StatusResponse);
29+
}
30+
31+
message UnsafeHead {
32+
uint64 l2_number = 1;
33+
bytes l2_hash = 2; // 32 bytes
34+
string l1_origin = 3; // opaque or hash/height string
35+
int64 timestamp = 4; // unix seconds
36+
}
37+
38+
message LeadershipTerm {
39+
uint64 term = 1; // monotonic term/epoch for fencing
40+
string leader_id = 2; // conductor/node ID
41+
}
42+
43+
message StartSequencerRequest {
44+
bool from_unsafe_head = 1; // if false, uses safe head per policy
45+
bytes lease_token = 2; // opaque, issued by control plane (Raft/Lease)
46+
string reason = 3; // audit string
47+
string idempotency_key = 4; // optional, de-duplicate retries
48+
string requester = 5; // principal for audit
49+
}
50+
message StartSequencerResponse {
51+
bool activated = 1;
52+
LeadershipTerm term = 2;
53+
UnsafeHead unsafe = 3;
54+
}
55+
56+
message StopSequencerRequest {
57+
bytes lease_token = 1;
58+
bool force = 2;
59+
string reason = 3;
60+
string idempotency_key = 4;
61+
string requester = 5;
62+
}
63+
message StopSequencerResponse {
64+
bool stopped = 1;
65+
}
66+
67+
message PrepareHandoffRequest {
68+
bytes lease_token = 1;
69+
string target_id = 2; // logical target node ID
70+
string reason = 3;
71+
string idempotency_key = 4;
72+
string requester = 5;
73+
}
74+
message PrepareHandoffResponse {
75+
bytes handoff_ticket = 1; // opaque, bound to term+unsafe head
76+
LeadershipTerm term = 2;
77+
UnsafeHead unsafe = 3;
78+
}
79+
80+
message CompleteHandoffRequest {
81+
bytes handoff_ticket = 1;
82+
string requester = 2;
83+
string idempotency_key = 3;
84+
}
85+
message CompleteHandoffResponse {
86+
bool activated = 1;
87+
LeadershipTerm term = 2;
88+
UnsafeHead unsafe = 3;
89+
}
90+
91+
message HealthRequest {}
92+
message HealthResponse {
93+
bool healthy = 1;
94+
uint64 l2_number = 2;
95+
bytes l2_hash = 3;
96+
string l1_origin = 4;
97+
uint64 peer_count = 5;
98+
uint64 da_height = 6;
99+
string last_err = 7;
100+
}
101+
102+
message StatusRequest {}
103+
message StatusResponse {
104+
bool sequencer_active = 1;
105+
string build_version = 2;
106+
string leader_hint = 3; // optional, human-readable
107+
string last_err = 4;
108+
LeadershipTerm term = 5;
109+
}
110+

0 commit comments

Comments
 (0)