TigerChat uses a three-layer protocol architecture:
- VSR Layer: Viewstamped Replication for consensus per room shard
- Transport Layer: Ed25519-signed messages with mTLS
- Client Layer: WebSocket/HTTP3 edge protocol with JWT auth
┌──────────┐
│ Normal │◄─────┐
└────┬─────┘ │
│ │
│ timeout/ │ view_change_ok
│ suspected │
│ │
▼ │
┌──────────┐ │
│View ├──────┘
│Change │
└────┬─────┘
│
│ recovery_needed
│
▼
┌──────────┐
│Recovering│
└──────────┘
Normal: Replica accepts client ops; primary broadcasts prepare → prepare_ok → commit.
View Change: New primary election; deterministic leader = (view_number mod 3).
Recovering: Replay from latest snapshot + log tail until caught up.
Client Edge Primary Replica-2 Replica-3 Bus
│ │ │ │ │ │
├─send_msg─► │ │ │ │
│ ├─prepare────► │ │ │
│ │ ├─prepare─────► │ │
│ │ ├─prepare──────────────────► │
│ │ │ │ │ │
│ │ ◄─prepare_ok──┤ │ │
│ │ ◄─prepare_ok───────────────┤ │
│ │ │ (quorum=2) │ │ │
│ │ ├─commit──────► │ │
│ │ ├─commit───────────────────► │
│ │ ├────────────────────────────fanout msg─►
│ ◄────────────────────────────────────────────────────┤
◄──msg_ack─┤ │ │ │ │
Primary-0 (suspected)
│
Replica-1 detects timeout → broadcast start_view_change(view=1)
│
Replica-2 ──┐
Replica-3 ──┼─► receive 2+ start_view_change → send do_view_change(view=1, log, op, commit)
│
New Primary-1 (deterministic):
│
├─ collect 2+ do_view_change
├─ merge logs (highest op number wins)
├─ broadcast start_view(view=1, log, op, commit_num)
│
Replicas install new view ──► resume Normal state
View Change Bound: < 300 ms (50 ms timeout × 2 retries + 200 ms merge/broadcast).
Every inter-node message has:
┌─────────────────────────────────────┐
│ Header (128 bytes, aligned) │
├─────────────────────────────────────┤
│ Body (variable, up to 1 MB) │
├─────────────────────────────────────┤
│ Ed25519 Signature (64 bytes) │
└─────────────────────────────────────┘
Header fields:
magic: u32=0x54_49_47_52("TIGR")version: u16=1command: u8(see Message Types below)checksum: u32(CRC32C of header+body)nonce: u64(monotonic per sender; replay protection)view: u32op: u64(operation number)commit_num: u64cluster_id: u128(prevents cross-cluster message)sender_id: u8(replica 0..2)body_size: u32
Signature: Ed25519 over header || body. Public keys exchanged during cluster bootstrap.
mTLS: TLS 1.3 with certificate pinning; all replicas have pre-distributed certs.
| Command | Name | Direction | Body |
|---|---|---|---|
0x01 |
prepare |
Primary → Replicas | Message (user op) |
0x02 |
prepare_ok |
Replicas → Primary | {op, checksum} |
0x03 |
commit |
Primary → Replicas | {commit_num} |
0x04 |
start_view_change |
Any → All | {view} |
0x05 |
do_view_change |
Replicas → New Primary | {view, log[], op, commit} |
0x06 |
start_view |
New Primary → All | {view, log[], op, commit} |
0x07 |
request |
Client → Primary | Message (forwarded by edge) |
0x08 |
reply |
Primary → Client | {result, op} |
| Command | Name | Direction | Body |
|---|---|---|---|
0x20 |
send_message |
Client → Edge | SendMessageRequest |
0x21 |
message_ack |
Edge → Client | MessageAck |
0x22 |
message_event |
Edge → Client | MessageEvent (fan-out) |
0x23 |
subscribe_room |
Client → Edge | {room_id, since_op} |
0x24 |
snapshot_request |
Client → Edge | {room_id, until_op} |
0x25 |
snapshot_chunk |
Edge → Client | {messages[], has_more} |
Client Edge Gateway
│ │
├──WS handshake + JWT───────►
│ ├─validate JWT
│ ├─rate limit check
◄──────────────────accept───┤
│ │
├──subscribe_room(room_id)──►
│ ├─forward to primary
◄──────────────────ack──────┤
│ │
├──send_message─────────────►
│ ├─assign idempotency key
│ ├─prepare to VSR primary
│ ├─wait quorum commit
◄──message_ack(op=1234)─────┤
│ │
◄──message_event────────────┤ (fan-out from bus)
◄──message_event────────────┤
│ │
Idempotency: Edge assigns {client_id, client_seq} → op mapping. Duplicate send_message returns cached op.
Catch-up: If client reconnects, sends subscribe_room(since_op=last_seen). Edge sends buffered events or triggers snapshot.
const Message = extern struct {
room_id: u128, // shard key
msg_id: u128, // unique (UUID v7)
author_id: u64,
parent_id: u128, // 0 for top-level; else thread parent
timestamp: u64, // microseconds since epoch
body_len: u32,
body: [2048]u8, // max inline; larger → external blob
checksum: u32, // CRC32C(msg_id..body)
};
comptime {
assert(@sizeOf(Message) == 2240); // cache-line friendly
assert(@alignOf(Message) == 16);
}const RoomState = struct {
room_id: u128,
op: u64, // last applied op
commit_num: u64,
message_count: u64,
head_hash: [32]u8, // SHA256 chain
// in-memory index (not persisted; rebuilt from log)
message_index: std.AutoHashMap(u128, u64), // msg_id → op
};┌────────────────────┐
│ op: u64 │
├────────────────────┤
│ checksum: u32 │ ─┐
├────────────────────┤ │ CRC32C of this region
│ Message (2240 B) │ ─┘
└────────────────────┘
WAL append is atomic: write entry → fsync → update op pointer (atomic rename of metadata file).
- Read last checkpoint (snapshot at
op=N). - Scan WAL from
N+1to EOF. - Replay each entry through state machine.
- Verify
head_hashchain integrity (each message includesprev_hash). - If corruption detected → panic (operator must restore from replica).
┌──────────────────────────┐
│ magic: u64 = 0x544947_534E4150 ("TIG_SNAP")
├──────────────────────────┤
│ version: u16 = 1 │
├──────────────────────────┤
│ room_id: u128 │
├──────────────────────────┤
│ op: u64 │
├──────────────────────────┤
│ message_count: u64 │
├──────────────────────────┤
│ Message[0] │
│ Message[1] │
│ ... │
│ Message[N-1] │
├──────────────────────────┤
│ Ed25519 signature │
└──────────────────────────┘
Snapshot triggered every 10,000 ops or 1 GB WAL size.
- Message Integrity: Every WAL entry has CRC32C; every snapshot has Ed25519 signature.
- Replay Protection: Nonce field in transport header; replicas reject
nonce ≤ last_seen[sender]. - View Monotonicity:
viewmust increase; replicas ignorestart_viewwithview ≤ current_view. - Cluster Isolation:
cluster_idprevents cross-cluster message injection.
- Single-writer WAL: No lock contention; one
fsyncper commit (group commit in future). - Bounded message size: 2 KB inline → predictable latency, no heap allocation per message.
- Zero-copy fan-out: Edge replicas mmap committed log; send via
sendfile(2). - Tail latency: All queues bounded; worst-case = 2× timeout (view change).
| Scenario | Detection | Recovery | Bound |
|---|---|---|---|
| Primary crash | Prepare timeout (50 ms) | View change | 300 ms |
| Replica crash | Heartbeat miss | Ignore; quorum=2 sufficient | N/A |
| Network partition | Prepare timeout | View change if quorum lost | 300 ms |
| WAL corruption | CRC32C or hash chain failure | Panic → operator restore from peer | Manual |
| Byzantine message | Signature verification failure | Drop + alert | Immediate |
Key invariant: If 2+ replicas survive, no data loss.
- Viewstamped Replication Revisited (Liskov & Cowling, 2012)
- TigerBeetle VOC protocol (visual consensus)
- NASA Power of Ten (bounded everything)