Operations Guide

Practical reference for running rustbgpd in production. For config syntax, see CONFIGURATION.md. For security posture, see SECURITY.md.

Starting the daemon

rustbgpd /etc/rustbgpd/config.toml

Or via systemd (see examples/systemd/rustbgpd.service):

sudo systemctl start rustbgpd

The daemon validates the config file at startup. Validation errors display rustc-style diagnostics showing the offending TOML line with column markers:

error: invalid hold_time 2: must be 0 or >= 3
  --> /etc/rustbgpd/config.toml:12:13
   |
12 | hold_time = 2
   |             ^ must be 0 or >= 3

The daemon exits with code 1 — it never starts with an invalid config.

On success, structured JSON logs go to stdout. The daemon is ready when you see the starting rustbgpd log line with version, ASN, and router ID.

Per-peer log filtering

Set log_level on any neighbor or peer group to override the global log level:

[[neighbors]]
address = "10.0.0.1"
remote_asn = 65001
log_level = "debug"

Or filter via RUST_LOG using the per-peer tracing span:

RUST_LOG=info,peer{peer_addr=10.0.0.1}=debug rustbgpd /etc/rustbgpd/config.toml

Config validation

Validate a config file without starting the daemon:

rustbgpd --check /etc/rustbgpd/config.toml

Prints rustc-style diagnostics on error, or config OK on success.

Config diff (dry-run reload)

Preview what a SIGHUP reload would change before sending it:

# Compare proposed config against current config
rustbgpd --diff /tmp/new-config.toml /etc/rustbgpd/config.toml

# JSON output for scripting
rustbgpd --diff /tmp/new-config.toml /etc/rustbgpd/config.toml --json

Output is grouped into three sections:

Reload-applied changes — neighbor add/remove/modify that SIGHUP will reconcile immediately.
Restart-required changes — [global], [rpki], [bmp], [mrt] changes that require a full daemon restart.
Informational — peer group and policy changes that are detected but not reconciled by the current SIGHUP path. Shown for visibility only.

Exit codes: 0 = no actionable changes, 1 = actionable changes found, 2 = error (bad config, missing file).

Configuration reload (SIGHUP)

sudo systemctl reload rustbgpd
# or: kill -HUP $(pidof rustbgpd)

What happens:

The daemon re-reads the TOML config file from disk.
diff_neighbors() computes per-peer add/remove/change deltas.
New peers are added, removed peers get NOTIFICATION and teardown, changed peers are removed and re-added.
Global section changes ([global], [rpki], [bmp], [mrt]) are logged as warnings and require a full restart to take effect.
Peer group and policy changes are not currently reconciled — they are detected and logged but not applied. This is a known limitation tracked in the roadmap.

Reload failures are logged per-peer. If reconciliation fails, the daemon keeps the previous in-memory config and continues running.

Use rustbgpd --diff to preview changes before reloading.

What state persists

State	Where	When
Neighbor add/delete via gRPC	Config file (atomic write)	Immediately on mutation
GR restart marker	`<runtime_state_dir>/gr-restart.toml`	On coordinated shutdown
MRT dump files	`[mrt] output_dir`	On periodic timer or `TriggerMrtDump`
gRPC UDS socket	`<runtime_state_dir>/grpc.sock`	Daemon lifetime

Not persisted: routing state (Adj-RIB-In, Loc-RIB, Adj-RIB-Out), policy evaluation state, RPKI VRP tables, BMP client state. All routing state is rebuilt from peers after restart.

Upgrading

Build the new version: cargo build --release
Stop the daemon: systemctl stop rustbgpd (or rustbgpctl shutdown)
Replace the binary at /usr/local/bin/rustbgpd
Start: systemctl start rustbgpd

When Graceful Restart is enabled (the default), the coordinated shutdown in step 2 writes a GR restart marker. On step 4, the daemon advertises R=1 to static peers, asking them to retain our routes while we reconnect. The restart window is the largest gr_restart_time among all GR-enabled peers.

For zero-downtime upgrades in a route-server pair, drain traffic to the standby, upgrade, then swap.

Failure modes

gRPC server dies unexpectedly

The daemon treats an unexpected gRPC server exit as fatal and initiates a coordinated shutdown (NOTIFICATION to all peers, GR marker write). This is deliberate: losing the control plane means losing the ability to shut down cleanly later. See ADR-0022.

RPKI cache unreachable

Each RTR client reconnects independently after a fixed retry_interval (default 600s). If no fresh EndOfData arrives before expire_interval (default 7200s), cached VRPs for that server are discarded. Routes are re-validated against the remaining VRP table.

When all caches are down, the VRP table is empty and all routes have validation state NotFound. If your policy denies NotFound routes, this will cause route drops. The recommended policy is to deny Invalid and prefer Valid, leaving NotFound as a neutral fallback.

BMP collector unreachable

Each BMP client reconnects independently with backoff (default reconnect_interval = 30s). During disconnection, BMP events for that collector are dropped. No routing state is affected — BMP is purely observational. On reconnect, the client sends a fresh Initiation message; the collector rebuilds state from subsequent Peer Up and Route Monitoring messages.

MRT dump failure

If the output directory is not writable, the MRT manager logs an error and skips that dump cycle. Periodic dumps continue on the next interval. The daemon does not crash on MRT failures.

Peer max-prefix exceeded

When a peer sends more prefixes than max_prefixes, the daemon sends a NOTIFICATION (Cease / Maximum Number of Prefixes Reached) and tears down the session. The peer is not automatically re-enabled — use rustbgpctl neighbor <addr> enable or the gRPC EnableNeighbor RPC to restart it.

Key metrics to watch

Metrics are exposed on the Prometheus endpoint if prometheus_addr is configured. If omitted, metrics are still collected internally and available via gRPC GetMetrics and GetHealth RPCs.

Health

Metric	What it tells you
`bgp_peers_established`	Number of peers in Established state
`bgp_peers_configured`	Total configured peers
`bgp_uptime_seconds`	Daemon uptime

Routing

Metric	What it tells you
`bgp_rib_prefixes{table="loc_rib"}`	Loc-RIB size (best paths)
`bgp_rib_prefixes{table="adj_rib_in"}`	Total received prefixes
`bgp_rib_prefixes{table="adj_rib_out"}`	Total advertised prefixes
`bgp_updates_received_total`	Inbound UPDATE count
`bgp_updates_sent_total`	Outbound UPDATE count

Graceful Restart

Metric	What it tells you
`bgp_gr_active_peers`	Peers currently in GR stale-route state
`bgp_gr_stale_routes`	Routes currently marked stale
`bgp_gr_timer_expired_total`	GR timers that expired (routes swept)

RPKI

Metric	What it tells you
`bgp_rpki_vrp_count{af="ipv4"}`	IPv4 VRP entries loaded
`bgp_rpki_vrp_count{af="ipv6"}`	IPv6 VRP entries loaded

A sudden drop in VRP count likely means a cache connection was lost or the cache itself has stale data.

Key log messages

rustbgpd uses structured JSON logging. Key messages to watch for:

Message	Level	Meaning
`starting rustbgpd`	INFO	Daemon started successfully
`peer session established`	INFO	BGP session reached Established
`peer session down`	INFO	BGP session left Established
`received shutdown signal`	INFO	SIGTERM/SIGINT received
`shutdown initiated via gRPC`	INFO	`Shutdown` RPC called
`gRPC server exited unexpectedly`	ERROR	Fatal — coordinated shutdown follows
`config reloaded`	INFO	SIGHUP reload succeeded
`config reload failed`	ERROR	SIGHUP reload failed — previous config kept
`GR restart marker`	INFO	Restart marker written or read
`max-prefix limit exceeded`	WARN	Peer exceeded prefix limit
`gRPC TCP listener bound to a non-loopback address`	WARN	Security posture warning

Debugging a session that won't establish

Check peer state:
```
rustbgpctl neighbor
```
Look at the FSM state. Active means we're trying to connect but TCP isn't establishing. OpenSent/OpenConfirm means OPEN exchange is failing.
Check logs for the peer:
```
journalctl -u rustbgpd | grep "10.0.0.2"
```
Look for NOTIFICATION codes, capability mismatches, or hold timer expiry.
Common causes:
- TCP not reaching: Firewall, wrong address, peer not listening on 179
- ASN mismatch: Remote peer has a different remote-as configured for us
- Router ID collision: Two speakers with the same router ID
- Hold timer zero vs non-zero: One side sends hold_time=0, the other expects keepalives
- Capability mismatch: Check address family negotiation in OPEN logs
- MD5 mismatch: TCP RST with no BGP-level error; check both sides' passwords
- TTL security: GTSM requires TTL=255; multi-hop peers will fail
Verify from the remote side: Check FRR/BIRD/peer logs for their view of the session attempt.

Common operational tasks

Add a peer at runtime

rustbgpctl neighbor 10.0.0.5 add --asn 65005 --description "new-peer"

The peer is persisted to the config file automatically.

Remove a peer

rustbgpctl neighbor 10.0.0.5 delete

Sends NOTIFICATION, tears down the session, removes from config.

Soft reset (re-evaluate import policy)

rustbgpctl neighbor 10.0.0.2 softreset

Re-applies import policy to all routes from this peer without tearing down the session.

Enable / disable a peer

rustbgpctl neighbor 10.0.0.2 enable
rustbgpctl neighbor 10.0.0.2 disable --reason "maintenance"

Trigger an MRT dump

rustbgpctl mrt-dump

Live dashboard

rustbgpctl top          # default 2s poll
rustbgpctl top -i 5     # 5s poll interval

Shows sessions, prefix counts, message rates, RPKI VRP counts, and streaming route events in a terminal UI. Press h for keybindings.

Check health

rustbgpctl health

View received routes from a peer

rustbgpctl rib received 10.0.0.2

View best routes (Loc-RIB)

rustbgpctl rib

Graceful shutdown

rustbgpctl shutdown

Sends NOTIFICATION to all peers, writes GR marker, exits cleanly.

Explain best-path selection

rustbgpctl rib --prefix 10.0.0.0/24 --explain

Shows all candidates for a prefix with the decisive comparison reason for each non-winner (e.g., higher_local_pref, shorter_as_path).

Looking glass (birdwatcher-compatible REST API)

Optional HTTP server for external looking glass frontends (Alice-LG, etc.). Configure in TOML:

[global.telemetry.looking_glass]
addr = "0.0.0.0:8080"

Endpoints: /status, /protocols/bgp, /routes/protocol/{id}, /routes/peer/{peer}. Omit the section entirely to disable.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Operations Guide

Starting the daemon

Per-peer log filtering

Config validation

Config diff (dry-run reload)

Configuration reload (SIGHUP)

What state persists

Upgrading

Failure modes

gRPC server dies unexpectedly

RPKI cache unreachable

BMP collector unreachable

MRT dump failure

Peer max-prefix exceeded

Key metrics to watch

Health

Routing

Graceful Restart

RPKI

Key log messages

Debugging a session that won't establish

Common operational tasks

Add a peer at runtime

Remove a peer

Soft reset (re-evaluate import policy)

Enable / disable a peer

Trigger an MRT dump

Live dashboard

Check health

View received routes from a peer

View best routes (Loc-RIB)

Graceful shutdown

Explain best-path selection

Looking glass (birdwatcher-compatible REST API)

Uh oh!

FilesExpand file tree

OPERATIONS.md

Latest commit

History

OPERATIONS.md

File metadata and controls

Operations Guide

Starting the daemon

Per-peer log filtering

Config validation

Config diff (dry-run reload)

Configuration reload (SIGHUP)

What state persists

Upgrading

Failure modes

gRPC server dies unexpectedly

RPKI cache unreachable

BMP collector unreachable

MRT dump failure

Peer max-prefix exceeded

Key metrics to watch

Health

Routing

Graceful Restart

RPKI

Key log messages

Debugging a session that won't establish

Common operational tasks

Add a peer at runtime

Remove a peer

Soft reset (re-evaluate import policy)

Enable / disable a peer

Trigger an MRT dump

Live dashboard

Check health

View received routes from a peer

View best routes (Loc-RIB)

Graceful shutdown

Explain best-path selection

Looking glass (birdwatcher-compatible REST API)