- Closes (design): CS-110 (restore never drilled), CS-111
(backups co-located with the DB), CS-112 (ClickHouse lake has no backup). Operator executes; everything here is scripted/config.
Context
- Postgres (served tier): pgBackRest, stanza
stellarindex,
WAL-archived, retention 2 fulls — but repo1 lives in the SAME host MinIO/ZFS pool as the database it protects. One pool loss destroys primary AND every backup.
- ClickHouse (raw lake, the ADR-0034 source of truth): zero
backup, not Ansible-provisioned. BUT the lake is *derivable*: it is a structural decode of the Galexie ledger archive, which exists in the local MinIO galexie-archive bucket AND publicly in aws-public-blockchain (ADR-0027 cold tier). The question is not "can we recover" but "how long" — the original full backfill was a multi-week job.
- No restore of anything has ever been executed — `pgbackrest
info` is the only verification (CS-110), and the dr-activation runbook overclaimed drill status (fixed, CS-113).
Decision
1. Postgres: offsite repo2, restore drilled monthly
- Add pgBackRest
repo2on offsite S3-compatible object storage
(Hetzner Storage Box or Backblaze B2 — operator picks the account; config vars are in the ansible role, gated on presence so the role is a no-op until credentials exist). Async archiving to both repos; retention: repo1 keeps 2 fulls (fast local restore), repo2 keeps 4 fulls (survival copy).
scripts/ops/restore-drill.sh(ships with this ADR) performs a
NON-DESTRUCTIVE scratch restore on r1: pgbackrest restore into a throwaway data dir, start a disposable postgres on port 5499, verify row-count + hash-chain sanity queries against the live DB, report, destroy. Wired to a monthly timer once the operator has run it by hand twice. A backup that has never restored is a hope, not a backup.
2. ClickHouse: protect the metadata, PROVE the re-derive, back up the tail
Full CH backup (~multi-TiB, growing) is rejected as the primary strategy: the lake's ground truth (raw LCM) already exists in two independent archives, and paying object-storage for a third copy of derived data is poor spend. Instead, three cheaper guarantees:
- Schema + state backup (tiny, daily):
SHOW CREATEDDL for
every table + the ch-live-catchup/backfill cursor state, pushed to repo2 alongside pgBackRest. Losing DDL/config is what turns a re-derive from "run the script" into archaeology.
- Re-derive path is drilled, not assumed: the restore drill's CH
half re-derives a RANDOM 100k-ledger window into a scratch database via the existing ch-backfill machinery and reconciles counts against the live lake. This proves the recovery machinery + measures throughput, giving an honest RTO figure (extrapolated full-rebuild time) reported into the drill log.
- Tail insurance: the newest N days of
contract_events+
ledgers (the window between Galexie-archive certification and live) are included in the daily offsite push — the only window where the lake could hold data the archives don't yet.
If the measured full-rebuild RTO exceeds what we can tolerate post-launch (verification + explorer lake surfaces dark for that long), REVISIT with clickhouse-backup incremental to offsite — the partition scheme (1M-ledger partitions, old partitions ~immutable) makes incrementals cheap. That decision needs the drill's throughput number first; do not pre-buy storage on a guess.
3. Drill logging is append-only evidence
Every drill run appends to docs/operations/drills/ (date, repo used, restore duration, verification results, RTO extrapolation). The sev-playbook's annual-DR section stays aspirational until R2/R3 exist; the monthly scratch drill is what we CAN honestly do on one host, so it is what we commit to.
Consequences
- A single ZFS-pool loss no longer destroys the backups with the
database (repo2), and "restore works" becomes a measured monthly fact with an evidence trail.
- The CH lake's protection cost is ~GBs/day (DDL + tail) instead of
TiBs, with the re-derive path exercised instead of trusted.
- Operator actions (accounts, credentials, first two hand-runs)
are queued in the operator register; everything else is committed code/config.