Orchestration Engine

(Part 2 of 2 — same chapter in the PDF; split for the web site.)

Read the current count; if at limit, back off.
MULTI/EXEC to atomically increment with a TTL (3600 s).
On step completion (or timeout), release_slot decrements. This is a global coordination mechanism if all workers share the same Redis instance. In a federated deployment, each cell’s Redis would maintain independent slot counts — a known gap discussed in Chapter 15. Additionally, EXTERNAL_API_RATE_LIMIT_RPS (10 requests/second) and exponential backoff retries constrain outbound client behavior at the constant layer. The fault taxonomy classifies upstream 429 / RATE_LIMITED responses as retryable. 6.7 Fault tolerance layers Transient failures — database connection drops, network blips, provider throttling — are handled by a four-layer defense:
Connection pool (database.py): pool_pre_ping tests connections before use; pool_recycle = 300 replaces connections older than five minutes; pool_reset_on_return = “rollback” ensures clean state.
Application-level retry (db_retry.py): retry_on_transient_db_error wraps short DB operations with exponential backoff (4 attempts, 1 s → 2 s → 4 s, cap 30 s). Session rollback between attempts provides a fresh connection from the pool.
Ingredient-level retry: Individual ingredients (e.g., store_media) use retry_on_transient_db_error for their own commit blocks, re-raising transient errors so the task layer can see them.
Task-level retry (recipe_executor.py): execute_recipe_step_v2 catches transient DB errors and calls Celery’s self.retry() with exponential backoff (15 s → 30 s → 60 s, cap 300 s), obtaining a new DB session on each attempt. Result: A single dropped connection during a commit is caught by layer 2. A failure that invalidates an entire step attempt is caught by layer 4, which re-runs the step with a fresh session. Pool health is maintained by layer 1 so fewer connections are stale. 6.8 Job cancellation JobStatus.CANCELLED exists in the schema, and the scry workflow cancel CLI command sets the status and marks the workflow as cancelled. However, workers do not currently honor cancellation during execution: once a task runs, it runs to completion (or failure) regardless of the job’s status in the database. A worker picking up a queued task does not re-check the status before executing. The mitigation design (documented internally) proposes: • An API endpoint (POST /jobs/{id}/cancel) with owner-scoped auth. • Cooperative cancellation: workers re-fetch job status before each step and exit early if CANCELLED. • Optional Celery revoke() on the task ID for queued-but-not-started tasks. This is an open engineering item, not a research question. It is listed here for completeness. 32

6.9 Server-Sent Events: the job stream GET /jobs/stream provides a Server-Sent Events (SSE) endpoint for real-time job status updates. The mechanism:

Client connects with a Bearer token; the server validates and extracts the user ID.
A daemon thread subscribes to the Redis pub/sub channel job_events:user:{user_id}.
Messages (JSON payloads with job_id, status, progress_pct, progress_message, timestamps) are forwarded as SSE data: lines.
Heartbeat comments (: heartbeat) are sent every 25 seconds to keep the connection alive through proxies.
Response headers disable proxy buffering (X-Accel-Buffering: no). The publish() function in job_events.py is fire-and-forget: it is safe to call even when no clients are subscribed, and it does not block the calling worker. This is a coarse-grained stream — job-level status transitions, not per-token inference output. Streaming inference tokens to clients would require a different channel architecture (per-job channels, backpressure, PII considerations); this is an open design question for future protocol-level streaming (MCP, A2A task streaming). 33

Source: transcribed from the compiled Scrypted Network Design whitepaper PDF for web reading. Layout, figures, and pagination may differ from the PDF.