Reliability, Operations, and Disaster Recovery
Chapter 15 Reliability, Operations, and Disaster Recovery Strategic Takeaway Four fault-tolerance layers (pool health, app retry, ingredient retry, task retry) mean upstream provider outages are absorbed, not propagated to users. Distributed job execution assumes failures are normal. This chapter consolidates the reliability posture: what breaks, what recovers automatically, what requires operator intervention, and what changes when the deployment model shifts from single-tenant SaaS to a multi-operator network. 15.1 Failure modes and layered mitigations The four-layer fault tolerance model (introduced in Chapter 6, §6.7) handles the most common transient failure — a dropped database connection — without manual intervention:
- Pool health — pool_pre_ping, pool_recycle, pool_reset_on_return.
- Application retry — retry_on_transient_db_error() with exponential backoff.
- Ingredient retry — individual ingredients re-raise transient errors so upper layers see them.
- Task retry — Celery self.retry() with fresh DB session. Beyond database failures, the shadow graph (Chapter 6, §6.2) provides crash recovery for the orchestration engine: if a worker dies mid-recipe, another worker can read the persisted step states and resume from the last committed point. acks_late ensures the broker redelivers the task. Idempotency is the practical escape hatch for exactly-once illusions. Steps that produce side effects (provider API calls, CDN uploads, webhook deliveries) must tolerate duplicate execution. The one-action-one-commit discipline prevents double-advancement within the shadow graph, but external side effects may fire twice. 15.2 Artifact storage: ephemeral CDN vs content-addressed The current artifact path is S3 + CloudFront (Chapter 8, §8.3): fast, operator-controlled, with TTL-based purge. The architecture accommodates content-addressed alternatives: 78
System Address Durability model S3 / CloudFront HTTPS URL Operator-controlled retention tiers; CDN edge caching. IPFS CID Pinning by operator or user; gateway latency. Arweave Transaction id Pay-once permanent (theory); legal right-to-delete friction. Walrus (Sui) Blob id Erasure-coded DA on Sui validator set; emerging. The ipfs_cid column exists in AssetRetention; the ingestion service accepts storage_preference of CDN, IPFS, or ALL; and the cleanup worker logs that IPFS unpinning is stubbed. The integration is architecturally prepared but not wired. Open questions: who pays for pinning (platform, seller, or buyer); whether open-registry outputs default to ephemeral CDN only or offer an optional permanent tier; how to reconcile S3 ETags with CIDs for cross-reference in ERC-8004 metadata. 15.3 Webhook and callback standards The implemented webhook system (Chapter 8, §8.2) aligns partially with emerging industry standards: Standard Status in Scrypted HMAC-SHA256 signing Implemented (outbound and inbound). Timestamp + skew check Timestamp header sent; receiver-side skew rejection is a client concern. Idempotency keys Event IDs are generated; consumer dedup is recommended. Standard Webhooks spec Partial alignment; not formally declared as conformant. CloudEvents envelope Not implemented; candidate for internal queue normalization. Secret rotation Architecture supports overlapping secrets during rotation; verify schema. Official replay-window policy for customers, CloudEvents adoption for internal messages, and mTLS or signed URLs for high-value tenants are open items. 15.4 Model supply-chain integrity Confidential inference (TEE, Chapter 13) protects data at runtime but does not prove that the weights loaded into the enclave match a known-good build. Supply-chain threats include poisoned checkpoints, typosquat packages, malicious LoRA adapters, and compromised CI pipelines. Relevant standards: NIST SSDF (SP 800-218): Secure build, release, and vulnerability management expectations. SLSA: Supply-chain Levels for Software Artifacts — increasing levels of build integrity and provenance. 79
Sigstore: Keyless signing and transparency log for container and release signing. Ingredients live under ingredients/ as discrete Python packages — a natural surface for SBOM generation and signed release tags. Provider adapters that pull remote weights should document expected checksums per version; this is a code-review discipline, not yet automated. Whether on-chain digests of ingredient container images complement or replace off-chain SBOMs, who pays for third-party audits of popular open-weight ingredients, and revocation semantics when a signed build is later found vulnerable are open questions. 15.5 Multi-region and decentralized deployment The current deployment is single-region. Scaling to multi-region or federated operation introduces several concerns: Redis as SPOF: The ConcurrencyManager, job event pub/sub, and Celery broker all depend on Redis. Independent Redis instances per cell would multiply effective provider slots and partition job events. Shared Redis (or a coordination layer above it) is required for global consistency. PostgreSQL failover: RTO and RPO depend on the PITR / replica lag configuration. The shadow graph’s crash-recovery property helps — a worker can resume from committed state — but cross-region primary failover introduces a window of ambiguity. Job home routing: In a federated mesh, each job needs a defined “home” cell for its shadow graph. If cells share a logical database (multi-region Postgres), this is transparent; if each cell has its own primary, job routing must be deterministic or consensus-based. These are hypotheses for the network’s decentralized deployment story (Chapter 16). The current system’s fault tolerance is designed for a single-operator, single-region deployment with redundant workers. 80
Source: transcribed from the compiled Scrypted Network Design whitepaper PDF for web reading. Layout, figures, and pagination may differ from the PDF.