Failure Modes and Effects Analysis (FMEA) & Risk Assessment
**RPN (Risk Priority Number) = Severity × Probability × Detection**
Failure Modes and Effects Analysis (FMEA) & Risk Assessment
1. FMEA Table
Scoring Criteria
| Score | Severity | Probability | Detection |
|---|---|---|---|
| 1-2 | Negligible / Cosmetic | Extremely rare (<0.01%) | Always detected immediately |
| 3-4 | Minor / Recoverable | Rare (0.01-0.1%) | Usually detected quickly |
| 5-6 | Moderate / Degraded service | Occasional (0.1-1%) | Sometimes detected |
| 7-8 | Major / Service outage | Frequent (1-10%) | Rarely detected automatically |
| 9-10 | Critical / Data loss / Financial | Very frequent (>10%) | Undetectable until impact |
RPN (Risk Priority Number) = Severity × Probability × Detection
1.1 DependencyGraph Failure Modes
| ID | Component | Failure Mode | Effect | Sev | Prob | Det | RPN | Mitigation |
|---|---|---|---|---|---|---|---|---|
| DG-001 | DependencyGraph | Cycle introduced in DAG | Deadlock: tasks wait forever for each other; system hangs | 9 | 3 | 4 | 108 | Cycle detection on every addEdge() using Tarjan's algorithm; reject cyclic insertions |
| DG-002 | DependencyGraph | Stale dependency data | Tasks execute against outdated state; incorrect outputs committed | 8 | 5 | 6 | 240 | Version vectors on nodes; staleness TTL with forced refresh; consistency checks before execution |
| DG-003 | DependencyGraph | Memory leak (unbounded growth) | OOM crash; service restart; all in-flight speculation lost | 7 | 4 | 5 | 140 | Bounded cache with LRU eviction; periodic GC of confirmed/rolled-back nodes; memory usage alerts |
| DG-004 | DependencyGraph | Concurrent modification race | Corrupted graph state; undefined traversal behavior | 8 | 4 | 7 | 224 | Reader-writer locks; copy-on-write for traversals; optimistic locking with retry |
| DG-005 | DependencyGraph | Node deletion with active dependents | Orphaned tasks with unresolvable dependencies | 7 | 3 | 4 | 84 | Reference counting; soft-delete with tombstones; orphan detection on traversal |
| DG-006 | DependencyGraph | Topological sort failure | Incorrect execution ordering; proof ordering violation | 9 | 2 | 3 | 54 | Fallback to verified sort implementation; cross-validation with multiple algorithms |
1.2 CommitmentLedger Failure Modes
| ID | Component | Failure Mode | Effect | Sev | Prob | Det | RPN | Mitigation |
|---|---|---|---|---|---|---|---|---|
| CL-001 | CommitmentLedger | Lost commitment record | Task appears unexecuted; duplicate execution; economic loss | 9 | 3 | 6 | 162 | WAL (write-ahead log); synchronous persistence; periodic consistency audit |
| CL-002 | CommitmentLedger | Wrong stake calculation | Under-bonding: insufficient slashing; Over-bonding: agents locked out | 7 | 4 | 5 | 140 | Stake calculation unit tests; on-chain validation; automated reconciliation |
| CL-003 | CommitmentLedger | Persistence failure (disk full/corruption) | All commitments lost on restart; massive rollback required | 10 | 2 | 4 | 80 | Multi-region replication; disk space monitoring; integrity checksums |
| CL-004 | CommitmentLedger | State transition violation | Commitment in invalid state; proof submitted for unconfirmed task | 9 | 3 | 4 | 108 | State machine with explicit transitions; transition validation; audit logging |
| CL-005 | CommitmentLedger | Commitment hash collision | Two different results with same hash; integrity compromise | 10 | 1 | 8 | 80 | SHA-256 minimum; collision-resistant hashing; include task ID in hash input |
| CL-006 | CommitmentLedger | Double commitment (same task) | Duplicate proofs; wasted compute; potential economic exploit | 6 | 4 | 3 | 72 | Unique constraint on task_id; idempotency keys; deduplication at insertion |
| CL-007 | CommitmentLedger | Cross-shard inconsistency | Partial commit visible; reads return inconsistent state | 8 | 3 | 6 | 144 | Single-shard design initially; distributed transactions if sharding needed |
1.3 ProofDeferralManager Failure Modes
| ID | Component | Failure Mode | Effect | Sev | Prob | Det | RPN | Mitigation |
|---|---|---|---|---|---|---|---|---|
| PDM-001 | ProofDeferralManager | Proof lost in queue | Task never finalized; permanent pending state; stake locked forever | 9 | 3 | 5 | 135 | Persistent queue with acknowledgments; retry with exponential backoff; dead letter queue |
| PDM-002 | ProofDeferralManager | Wrong proof ordering (submitted before ancestors) | Invalid state transition on-chain; proof rejected; cascading failures | 10 | 3 | 3 | 90 | Strict ordering invariant check before submission; dependency graph query; blocking on ancestors |
| PDM-003 | ProofDeferralManager | Deadlock in proof pipeline | All proof generation stops; backpressure to execution; system freeze | 8 | 4 | 5 | 160 | Deadlock detection; timeout on all locks; circuit breaker pattern |
| PDM-004 | ProofDeferralManager | Proof generation timeout | Claim expires before proof ready; wasted execution; potential slash | 7 | 5 | 3 | 105 | Proof generation time estimation; claim buffer validation; early termination |
| PDM-005 | ProofDeferralManager | Invalid proof generated | On-chain rejection; slash triggered; reputation damage | 8 | 3 | 4 | 96 | Pre-submission verification; proof validation before queue; automated testing |
| PDM-006 | ProofDeferralManager | Queue overflow (backpressure) | New proofs rejected; execution blocked; throughput collapse | 6 | 5 | 3 | 90 | Bounded queue with FIFO eviction of old proofs; admission control; load shedding |
| PDM-007 | ProofDeferralManager | Worker thread exhaustion | Proof generation latency spike; claim expiry cascade | 6 | 4 | 4 | 96 | Thread pool monitoring; auto-scaling; work stealing |
1.4 RollbackController Failure Modes
| ID | Component | Failure Mode | Effect | Sev | Prob | Det | RPN | Mitigation |
|---|---|---|---|---|---|---|---|---|
| RC-001 | RollbackController | Incomplete rollback cascade | Orphaned speculative tasks; inconsistent state; potential double-spend | 10 | 3 | 5 | 150 | Transactional rollback; atomic batch operations; post-rollback consistency check |
| RC-002 | RollbackController | Double rollback (same task) | State corruption; negative balances; system instability | 8 | 3 | 4 | 96 | Idempotent rollback operations; state tracking; rollback ledger |
| RC-003 | RollbackController | Missed task in rollback scope | Inconsistent system state; proof for invalid lineage may succeed | 9 | 4 | 6 | 216 | Complete graph traversal; affected set computation validation; reconciliation checks |
| RC-004 | RollbackController | Rollback during ongoing execution | Partial results; corrupt output; undefined behavior | 8 | 4 | 5 | 160 | Execution cancellation protocol; graceful task abort; state isolation |
| RC-005 | RollbackController | Concurrent rollback conflicts | Race condition; conflicting state mutations | 8 | 3 | 5 | 120 | Global rollback lock; ordered rollback queue; conflict resolution |
| RC-006 | RollbackController | Rollback notification failure | Downstream components unaware; continue processing invalid data | 7 | 3 | 4 | 84 | Synchronous notification; acknowledgment required; retry on failure |
| RC-007 | RollbackController | Resource cleanup failure | Memory/connection leaks; gradual degradation | 5 | 5 | 4 | 100 | Resource tracking; cleanup verification; periodic resource audit |
1.5 SpeculativeScheduler Failure Modes
| ID | Component | Failure Mode | Effect | Sev | Prob | Det | RPN | Mitigation |
|---|---|---|---|---|---|---|---|---|
| SS-001 | SpeculativeScheduler | Wrong speculation decision (speculate on low-confidence) | High rollback probability; wasted compute; economic loss | 6 | 5 | 5 | 150 | ML-based confidence scoring; historical success rates; conservative thresholds |
| SS-002 | SpeculativeScheduler | Resource exhaustion (memory) | OOM; service crash; all speculation lost | 8 | 4 | 4 | 128 | Memory limits per speculation branch; admission control; resource reservation |
| SS-003 | SpeculativeScheduler | Deadlock in scheduling queue | New tasks not scheduled; throughput drops to zero | 8 | 3 | 4 | 96 | Lock-free queue; timeout on queue operations; deadlock detection |
| SS-004 | SpeculativeScheduler | Depth limit bypass | Unbounded speculation chains; exponential rollback risk | 9 | 2 | 3 | 54 | Hard enforcement at multiple layers; depth validation before execution |
| SS-005 | SpeculativeScheduler | Starvation of low-priority tasks | Unfair scheduling; SLA violations | 5 | 4 | 4 | 80 | Fair scheduling with priority aging; guaranteed minimum throughput |
| SS-006 | SpeculativeScheduler | Speculation on expiring claims | Claim expires before proof submission; guaranteed rollback | 6 | 5 | 3 | 90 | Claim buffer validation (claimBufferMs); reject near-expiry tasks |
| SS-007 | SpeculativeScheduler | Confidence score manipulation | Attacker influences scheduling decisions; gaming the system | 7 | 3 | 6 | 126 | Input validation; rate limiting; anomaly detection |
1.6 On-Chain Failure Modes
| ID | Component | Failure Mode | Effect | Sev | Prob | Det | RPN | Mitigation |
|---|---|---|---|---|---|---|---|---|
| OC-001 | OnChainSync | Transaction failure (rejected) | Commitment not recorded; must retry or rollback | 6 | 5 | 2 | 60 | Retry with backoff; pre-flight simulation; error classification |
| OC-002 | OnChainSync | Chain reorganization (reorg) | Confirmed transaction becomes unconfirmed; false positive finality | 9 | 3 | 5 | 135 | Wait for sufficient confirmations (32+ slots); reorg detection; re-validation |
| OC-003 | OnChainSync | Race condition (double-submission) | Duplicate transactions; nonce errors; wasted fees | 5 | 4 | 3 | 60 | Nonce management; transaction deduplication; idempotency |
| OC-004 | OnChainSync | RPC node failure/latency | Cannot submit transactions; system stalls | 7 | 4 | 3 | 84 | Multi-RPC failover; health checks; connection pooling |
| OC-005 | OnChainSync | Insufficient funds for fees | Transaction rejected; proof submission fails | 6 | 3 | 2 | 36 | Balance monitoring; auto-refill from treasury; fee estimation |
| OC-006 | OnChainSync | Smart contract bug | Incorrect on-chain state; funds locked; catastrophic failure | 10 | 2 | 7 | 140 | Formal verification; audits; upgrade path; bug bounty |
| OC-007 | OnChainSync | Finality observer lag | Stale confirmation status; premature proof submission | 7 | 4 | 5 | 140 | Real-time WebSocket subscriptions; freshness validation; slot-based checks |
| OC-008 | OnChainSync | Bond escrow failure | Agent cannot stake; blocked from speculation | 6 | 3 | 3 | 54 | Pre-validation of bond accounts; retry mechanisms |
| OC-009 | OnChainSync | Slash distribution failure | Affected agents not compensated; trust violation | 7 | 2 | 4 | 56 | Atomic slash+distribution; compensation queue; manual resolution fallback |
Summary Statistics
| Component | Total Failure Modes | Max RPN | Avg RPN | Critical (RPN > 150) |
|---|---|---|---|---|
| DependencyGraph | 6 | 240 | 142 | 2 |
| CommitmentLedger | 7 | 162 | 112 | 1 |
| ProofDeferralManager | 7 | 160 | 110 | 2 |
| RollbackController | 7 | 216 | 132 | 3 |
| SpeculativeScheduler | 7 | 150 | 103 | 1 |
| OnChainSync | 9 | 140 | 89 | 0 |
| TOTAL | 43 | 240 | 113 | 9 |
2. Risk Matrix
Visual Risk Matrix (Severity vs Probability)
| Severity ↓ / Probability → | 1-2 (Rare) | 3-4 (Low) | 5-6 (Medium) | 7-8 (High) | 9-10 (V.High) |
|---|---|---|---|---|---|
| 9-10 (Critical) | CL-005, CL-003, OC-006 | DG-001, RC-001, PDM-002, OC-002 | DG-002, RC-003 | — | — |
| 7-8 (High) | — | DG-004, SS-002, CL-007, RC-005 | PDM-003, RC-004, OC-007, OC-006 | — | — |
| 5-6 (Medium) | — | CL-002, PDM-005, CL-006 | SS-001, SS-006, PDM-004, PDM-006 | — | — |
| 3-4 (Low) | — | — | SS-005, RC-007 | — | — |
| 1-2 (Negligible) | — | — | — | — | — |
Risk levels: CRITICAL (Sev x Prob > 40) — immediate action. HIGH (20-40) — action within 1 sprint. MEDIUM (10-20) — monitor and plan. LOW (< 10) — accept or defer.
Risk Category Distribution
CRITICAL ZONE (Red) HIGH ZONE (Orange) MEDIUM ZONE (Yellow)
───────────────────── ────────────────── ────────────────────
• DG-002 (Stale data) • DG-001 (Cycles) • SS-001 (Wrong decision)
• RC-003 (Missed task) • DG-004 (Race) • SS-006 (Expiring claims)
• PDM-003 (Deadlock) • CL-001 (Lost commit) • PDM-004 (Timeout)
• RC-004 (Mid-exec) • CL-004 (State viol.) • CL-002 (Wrong stake)
• RC-001 (Incomplete) • PDM-001 (Lost proof) • SS-005 (Starvation)
• OC-002 (Reorg) • RC-007 (Cleanup)
• SS-002 (OOM) • OC-001 (Tx failure)3. Critical Risks (Top 5)
3.1 RISK-001: Stale Dependency Data (DG-002)
RPN: 240 | Severity: 8 | Probability: 5 | Detection: 6
Description
The DependencyGraph contains outdated information about task dependencies or completion status. This causes downstream tasks to execute against an assumed state that doesn't match the actual committed state.
Impact
Likelihood Assessment
Mitigation Strategy
| Layer | Action | Status |
|---|---|---|
| Prevention | Version vectors on all graph nodes; compare versions before execution | Planned |
| Prevention | Staleness TTL (configurable, default 5s) with forced refresh | Planned |
| Detection | Pre-execution consistency check: re-query critical dependencies | Planned |
| Detection | Hash of dependency state included in commitment; validate on-chain | Future |
| Recovery | Automatic rollback if staleness detected post-execution | Planned |
| Recovery | Reconciliation job comparing runtime and on-chain state | Future |
Residual Risk
After mitigations: RPN ~60 (Sev: 8, Prob: 2, Det: 4)
Owner
Team: Runtime Core
DRI: TBD (assign senior engineer)
Review Cadence: Weekly during implementation; Monthly post-launch
3.2 RISK-002: Missed Task in Rollback Scope (RC-003)
RPN: 216 | Severity: 9 | Probability: 4 | Detection: 6
Description
When a rollback is triggered, the RollbackController fails to identify all affected downstream tasks. Some speculative tasks continue execution or remain in a pending state with invalid ancestry.
Impact
Likelihood Assessment
Mitigation Strategy
| Layer | Action | Status |
|---|---|---|
| Prevention | Snapshot graph state at rollback initiation; traverse immutable copy | Planned |
| Prevention | Mark-and-sweep approach: mark all, then sweep | Planned |
| Detection | Post-rollback consistency audit: verify no orphans exist | Planned |
| Detection | Invariant: every non-root task has confirmed or rolled-back ancestors | Planned |
| Recovery | Orphan detection job with automatic cleanup | Planned |
| Recovery | Manual reconciliation tools for operators | Future |
Residual Risk
After mitigations: RPN ~36 (Sev: 9, Prob: 2, Det: 2)
Owner
Team: Rollback/Recovery
DRI: TBD
Review Cadence: Weekly
3.3 RISK-003: Concurrent Modification Race in DependencyGraph (DG-004)
RPN: 224 | Severity: 8 | Probability: 4 | Detection: 7
Description
Multiple threads/processes simultaneously modify the dependency graph, leading to data races that corrupt the graph structure. This can result in lost edges, phantom edges, or inconsistent traversal results.
Impact
Likelihood Assessment
Mitigation Strategy
| Layer | Action | Status |
|---|---|---|
| Prevention | Reader-writer locks with write preference | Planned |
| Prevention | Copy-on-write semantics for read-heavy traversals | Planned |
| Prevention | Single-writer principle: mutations through serialized channel | Future |
| Detection | Invariant assertions on every read: validate structural integrity | Planned |
| Detection | Fuzz testing with concurrent operations | Planned |
| Recovery | Rebuild graph from source of truth (CommitmentLedger) on corruption | Planned |
Residual Risk
After mitigations: RPN ~48 (Sev: 8, Prob: 2, Det: 3)
Owner
Team: Runtime Core
DRI: TBD
Review Cadence: During code review; weekly testing
3.4 RISK-004: Lost Commitment Record (CL-001)
RPN: 162 | Severity: 9 | Probability: 3 | Detection: 6
Description
A commitment record is created in memory but fails to persist to durable storage. On restart or crash, the commitment is lost, making it appear the task was never executed.
Impact
Likelihood Assessment
Mitigation Strategy
| Layer | Action | Status |
|---|---|---|
| Prevention | Write-ahead log (WAL) with synchronous commit | Planned |
| Prevention | Persistence confirmation before proceeding | Planned |
| Prevention | Redundant storage (multi-region replication) | Future |
| Detection | Periodic reconciliation: runtime state vs storage | Planned |
| Detection | Heartbeat/health checks on storage system | Planned |
| Recovery | Recover from on-chain state as source of truth | Planned |
| Recovery | Re-execution with idempotency guarantees | Planned |
Residual Risk
After mitigations: RPN ~27 (Sev: 9, Prob: 1, Det: 3)
Owner
Team: Storage/Persistence
DRI: TBD
Review Cadence: Monthly reliability review
3.5 RISK-005: Deadlock in Proof Pipeline (PDM-003)
RPN: 160 | Severity: 8 | Probability: 4 | Detection: 5
Description
The proof generation pipeline enters a deadlock state where workers are waiting on each other or on resources that will never become available. All proof generation stops, causing backpressure that freezes the entire system.
Impact
Likelihood Assessment
Mitigation Strategy
| Layer | Action | Status |
|---|---|---|
| Prevention | Lock ordering discipline: always acquire in defined order | Planned |
| Prevention | Timeout on all blocking operations (no infinite waits) | Planned |
| Prevention | Circuit breaker: detect overload, reject new work | Planned |
| Detection | Deadlock detection: monitor worker thread states | Planned |
| Detection | Throughput monitoring: alert on zero proofs/minute | Planned |
| Recovery | Automatic worker restart on deadlock detection | Planned |
| Recovery | Manual kill switch to drain and restart pipeline | Planned |
Residual Risk
After mitigations: RPN ~40 (Sev: 8, Prob: 2, Det: 2.5)
Owner
Team: Proof Generation
DRI: TBD
Review Cadence: Weekly performance review
4. Security Considerations
4.1 Economic Attacks
4.1.1 Griefing via Speculation
Attack Vector: Malicious agent intentionally triggers rollbacks to waste honest agents' compute resources.
Mechanism:
- Attacker claims parent task, creates speculative commitment
- Honest agents speculate on children, investing compute
- Attacker abandons parent or submits invalid proof
- All downstream work is wasted; honest agents bear compute cost
Impact: High (DoS on honest agents; economic drain)
Mitigations:
Residual Risk: Medium — Economic incentives reduce but don't eliminate griefing
4.1.2 Stake Manipulation
Attack Vector: Agent manipulates stake calculations to under-bond, reducing slashing penalty.
Mechanism:
- Exploit bug in stake calculation logic
- Report incorrect speculation depth
- Get slashed less than should be
Impact: Medium (economic leakage; unfair advantage)
Mitigations:
Residual Risk: Low — On-chain validation is authoritative
4.1.3 Front-Running Speculation
Attack Vector: Observer sees high-value speculative commitment and front-runs to claim task first.
Mechanism:
- Monitor mempool for speculative commitments
- Race to claim same task before original agent
- Profit from information asymmetry
Impact: Medium (unfair competition; MEV extraction)
Mitigations:
Residual Risk: Medium — MEV is inherent to public blockchains
4.2 DoS Vectors
4.2.1 Proof Queue Flooding
Attack Vector: Submit massive number of proof requests to overwhelm pipeline.
Mechanism:
- Create many speculative tasks with dependencies
- Each generates proof requests
- Pipeline overloaded; legitimate proofs delayed
Impact: High (service degradation for all users)
Mitigations:
Residual Risk: Low — Multiple layers of defense
4.2.2 Graph Explosion Attack
Attack Vector: Create complex dependency graphs designed to slow traversal.
Mechanism:
- Create dense graph with many edges
- Trigger rollback that requires full traversal
- System slows to crawl during traversal
Impact: Medium (temporary performance degradation)
Mitigations:
Residual Risk: Low — Limits bound worst-case
4.2.3 Storage Exhaustion
Attack Vector: Create many speculative commitments that never resolve.
Mechanism:
- Create speculative tasks at maximum allowed rate
- Never submit proofs
- Storage fills with pending commitments
Impact: Medium (disk space exhaustion)
Mitigations:
Residual Risk: Low — TTL and limits prevent unbounded growth
4.3 Information Leakage
4.3.1 Speculative State Exposure
Attack Vector: Probe speculative state to learn about pending computations.
Mechanism:
- Query dependency graph or commitment ledger
- Infer what tasks are being speculated on
- Gain competitive advantage
Impact: Low-Medium (information asymmetry)
Mitigations:
Residual Risk: Low — Access control is straightforward
4.3.2 Timing Side Channels
Attack Vector: Observe timing of operations to infer system state.
Mechanism:
- Measure response times for various queries
- Infer cache state, speculation depth, queue lengths
- Time attacks or gain unfair advantage
Impact: Low (difficult to exploit meaningfully)
Mitigations:
Residual Risk: Low — Timing attacks rarely critical here
4.4 Race Conditions
4.4.1 Rollback During Proof Submission
Attack Vector: Race between rollback signal and proof submission.
Mechanism:
- Task A fails, triggering rollback of dependent task B
- Simultaneously, B's proof is being submitted on-chain
- Proof succeeds before rollback completes
- Inconsistent state: B confirmed but A failed
Impact: High (invalid state committed on-chain)
Mitigations:
Residual Risk: Low — On-chain validation is the final authority
4.4.2 Concurrent Commitment Updates
Attack Vector: Two processes update same commitment concurrently.
Mechanism:
- Process A reads commitment state = SPECULATIVE
- Process B reads commitment state = SPECULATIVE
- A updates to PROVING
- B updates to ROLLED_BACK
- Final state depends on write order
Impact: Medium (inconsistent commitment state)
Mitigations:
Residual Risk: Low — OCC is well-understood pattern
5. Correctness Proof Sketches
5.1 Invariant: "Proof Never Submitted Before Ancestors Confirmed"
Formal Statement: For any task T with proof submitted at time t_s, all ancestors A ∈ ancestors(T) have confirmation time t_c(A) < t_s.
Proof Sketch:
- Base Case: Task T with no ancestors (root task)
- ancestors(T) = ∅
- Invariant trivially holds (vacuously true)
- Inductive Case: Task T with ancestors A₁, A₂, ..., Aₙ
Precondition (enforced by ProofDeferralManager):
- ProofDeferralManager maintains a blocking wait on each ancestor
- awaitAncestorConfirmation(T) blocks until ∀Aᵢ: status(Aᵢ) = CONFIRMED
Proof:
- Let T be any task with pending proof P
- Before submitProof(P) is called:
- checkSubmissionAllowed(T) queries CommitmentLedger
- For each Aᵢ ∈ ancestors(T):
- If status(Aᵢ) ≠ CONFIRMED: checkSubmissionAllowed returns FALSE
- Submission blocked until all ancestors confirmed
- Once all ancestors confirmed at times t_c(Aᵢ):
- checkSubmissionAllowed(T) returns TRUE at time t_check
- t_check > max(t_c(Aᵢ)) for all i
- submitProof(P) called at time t_s ≥ t_check
- Therefore: t_s > t_c(Aᵢ) for all ancestors Aᵢ ∎
- Edge Cases:
- Reorg: If ancestor Aᵢ is re-orged after t_s but before T's confirmation
- Detection: FinalityTracker monitors for reorgs
- Response: T's proof is also invalidated and must be resubmitted
- Concurrent ancestor confirmation: Multiple ancestors confirm simultaneously
- Handled: awaitAncestorConfirmation uses barrier synchronization
- All must confirm before barrier releases
Implementation Requirement: The checkSubmissionAllowed() function MUST be atomic with submitProof() to prevent TOCTOU race.
5.2 Invariant: "Rollback Cascade is Complete" (No Orphaned Tasks)
Formal Statement: After rollback(T) completes, for all tasks D ∈ descendants(T): status(D) ∈ {ROLLED_BACK, NEVER_STARTED}.
Proof Sketch:
- Algorithm:
rollbackCascade(T)operates as follows:
function rollbackCascade(T):
affected = computeAffectedSet(T)
for D in reverse_topological_order(affected):
markRolledBack(D)
validateNoOrphans(affected)- Proof of Completeness:
Claim: computeAffectedSet(T) returns all descendants of T.
Proof:
- computeAffectedSet performs BFS from T following dependency edges
- DependencyGraph maintains bidirectional edges: parent→children, child→parents
- BFS visits every reachable node from T via child edges
- By definition, descendants(T) = reachable nodes via child edges
- Therefore, affected ⊇ descendants(T)
Claim: Every D ∈ affected is marked ROLLED_BACK.
Proof:
- Loop iterates over all elements in affected set
- Each iteration calls markRolledBack(D)
- markRolledBack is idempotent (marking twice is safe)
- After loop: ∀D ∈ affected: status(D) = ROLLED_BACK
Claim: No orphans exist after rollback.
Proof (by contradiction):
- Assume orphan O exists: O ∈ descendants(T) but status(O) ≠ ROLLED_BACK
- Since O ∈ descendants(T), O is reachable from T via child edges
- Therefore O ∈ affected (by completeness of BFS)
- But we proved ∀D ∈ affected: status(D) = ROLLED_BACK
- Contradiction. Therefore no orphan exists. ∎
- Validation Step:
validateNoOrphans(affected):
function validateNoOrphans(affected):
for D in affected:
assert status(D) == ROLLED_BACK
for child in children(D):
assert child in affected OR status(child) == NEVER_STARTED- Edge Cases:
- Concurrent task creation: New descendant created during rollback
- Prevention: Acquire write lock on affected subgraph before rollback
- New tasks blocked until rollback completes
- Already confirmed descendant: D confirmed on-chain before rollback
- This should be impossible if ancestor hasn't confirmed (Invariant 5.1)
- If occurs due to bug: Critical alert; manual reconciliation required
5.3 Invariant: "Depth Limit Enforced" (Bounded Speculation)
Formal Statement: For all tasks T in speculative state, depth(T) ≤ max_depth where depth(T) = length of longest path from any confirmed ancestor to T.
Proof Sketch:
- Definition:
- depth(T) = 0 if T has no speculative ancestors (all ancestors confirmed)
- depth(T) = max(depth(parent) for parent in speculative_parents(T)) + 1
- Enforcement Points:
Point A: Task scheduling in SpeculativeScheduler
function scheduleSpeculative(T):
currentDepth = computeSpeculativeDepth(T)
if currentDepth >= config.max_depth:
return REJECT_DEPTH_EXCEEDED
// ... proceed with schedulingPoint B: Commitment creation in CommitmentLedger
function createCommitment(T, depth):
if depth > config.max_depth:
throw DepthExceededException
// ... create commitmentPoint C: On-chain validation (Solana program)
pub fn create_speculative_commitment(ctx: Context, depth: u8) -> Result<()> {
require!(depth <= ctx.accounts.config.max_depth, ErrorCode::DepthExceeded);
// ... proceed
}- Proof of Enforcement:
Claim: No task T can enter speculative execution with depth > max_depth.
Proof:
- Path to speculative execution: scheduleSpeculative → executeTask → createCommitment
- At scheduleSpeculative: depth checked, rejected if exceeds
- At createCommitment: depth checked again (defense in depth)
- On-chain: final validation before commitment recorded
- All three checks must pass for depth > max_depth task
- If any check fails, task is not speculatively executed
- Therefore invariant holds ∎
- Depth Calculation Correctness:
Claim: computeSpeculativeDepth(T) correctly computes depth(T).
Proof:
function computeSpeculativeDepth(T):
if all ancestors confirmed:
return 0
maxParentDepth = max(computeSpeculativeDepth(P) for P in parents(T) if P.speculative)
return maxParentDepth + 1- Base case: No speculative ancestors → depth = 0 ✓
- Inductive case: depth = max parent depth + 1
- This matches the definition exactly ∎
- Edge Cases:
- Ancestor confirms after depth calculation: Depth decreases; still within limit
- Config change lowering max_depth: Existing tasks grandfathered; new tasks use new limit
- Concurrent depth calculation: Each calculation independent; no race condition
6. Assumptions & Dependencies
6.1 System Assumptions
| ID | Assumption | Consequence if Violated | Validation |
|---|---|---|---|
| A-001 | Solana network is available and responsive | System cannot progress; tasks stall | RPC health monitoring; failover |
| A-002 | Clock skew between nodes < 10 seconds | Claim expiry calculations may be wrong | NTP sync required; bound checked |
| A-003 | Storage system is durable (survives crashes) | Data loss; inconsistent state | Use managed DBs with replication |
| A-004 | Network partitions are temporary (< 5 minutes) | Extended partition causes mass rollback | Partition detection; graceful degrade |
| A-005 | Agents are rational economic actors | Attacks may not follow economic models | Defense in depth; monitoring |
| A-006 | ZK proof generation is deterministic | Same inputs may produce different proofs | Validate proof system properties |
| A-007 | Hash functions are collision-resistant | Commitment integrity compromised | Use standard cryptographic hashes |
| A-008 | Solana slot time is approximately 400ms | Timing calculations may drift | Use slot numbers, not wall time |
| A-009 | Smart contract is bug-free after audit | On-chain state corruption possible | Audit; formal verification; upgrade path |
| A-010 | Runtime has sufficient memory for graph | OOM crash; service outage | Memory limits; monitoring; scaling |
6.2 External Dependencies
| Dependency | Type | Failure Mode | Impact | Mitigation |
|---|---|---|---|---|
| Solana RPC | Infrastructure | Unavailable / slow | Cannot submit transactions | Multi-RPC; health checks; failover |
| Solana Validator | Infrastructure | Network congestion | Delayed confirmations | Patience; retry; alternative routes |
| PostgreSQL | Data Store | Crash / corruption | State loss | Replication; backups; WAL |
| Redis | Cache | Crash / eviction | Performance degradation | Treat as cache, not source of truth |
| ZK Prover | Compute | Crash / timeout | Proofs not generated | Retry; alternative provers; timeout |
| Monitoring (Grafana) | Observability | Unavailable | Blind operations | Backup alerting; log analysis |
| Time Service (NTP) | Infrastructure | Drift | Incorrect expiry calculations | Multiple NTP sources; drift detection |
6.3 Dependency Failure Modes
Solana RPC Failure
| Failure | Impact | Response |
|---|---|---|
| RPC timeout | Tx not submitted | Retry with backoff |
| RPC error 429 | Rate limited | Back off; use alt RPC |
| RPC disconnect | Connection lost | Reconnect; resume from state |
| RPC wrong data | Bad chain state | Cross-validate with other RPCs |
Database Failure
| Failure | Impact | Response |
|---|---|---|
| Primary down | Writes blocked | Failover to replica |
| Replication lag | Stale reads | Route reads to primary |
| Disk full | All ops blocked | Alert; emergency cleanup |
| Corruption | Data loss | Restore from backup |
7. Monitoring & Detection
7.1 Critical Metrics
| Metric | Description | Alert Threshold | Response Time |
|---|---|---|---|
speculation.rollback_rate | Rollbacks per minute | > 10/min | 5 minutes |
speculation.depth.p99 | 99th percentile speculation depth | > max_depth - 1 | 15 minutes |
speculation.orphan_count | Tasks with invalid ancestry | > 0 | Immediate |
commitment.pending_duration_p99 | Time in pending state | > 120s | 10 minutes |
proof.queue_depth | Pending proof requests | > 80% capacity | 5 minutes |
proof.generation_errors | Failed proof generations | > 5/min | 5 minutes |
dependency_graph.cycle_detected | Cycle detection events | > 0 | Immediate |
dependency_graph.memory_bytes | Graph memory usage | > 80% limit | 15 minutes |
rollback.cascade_size_p99 | Tasks affected per rollback | > 50 | 10 minutes |
rollback.duration_p99 | Rollback completion time | > 30s | 5 minutes |
onchain.tx_failure_rate | Transaction failure rate | > 5% | 5 minutes |
onchain.confirmation_lag | Time to confirmation | > 60s | 10 minutes |
7.2 Detection Strategies by Failure Mode
DG-002: Stale Dependency Data
| Detection Method | Metric/Signal | Alert Condition |
|---|---|---|
| Freshness check | dependency.staleness_age_ms | p99 > 5000ms |
| Version mismatch | dependency.version_mismatch_count | > 0 in 1 minute |
| Reconciliation diff | dependency.reconciliation_drift | > 0 items |
Response: Trigger full graph refresh; investigate source of staleness.
RC-003: Missed Task in Rollback
| Detection Method | Metric/Signal | Alert Condition |
|---|---|---|
| Orphan detector | rollback.orphan_tasks_detected | > 0 |
| Consistency audit | rollback.post_audit_failures | > 0 |
| Anomaly detection | Tasks with speculative ancestors that are rolled_back | exists |
Response: Immediate page; run manual reconciliation; root cause analysis.
PDM-003: Deadlock in Proof Pipeline
| Detection Method | Metric/Signal | Alert Condition |
|---|---|---|
| Throughput drop | proof.completed_per_minute | < 1 for 5 minutes |
| Queue growth | proof.queue_depth increasing | monotonic for 10 minutes |
| Worker health | proof.active_workers | < expected |
| Lock analysis | Thread dump analysis | Deadlock pattern detected |
Response: Restart proof workers; if persists, full service restart.
CL-001: Lost Commitment Record
| Detection Method | Metric/Signal | Alert Condition |
|---|---|---|
| Write confirmation | commitment.write_failures | > 0 |
| Reconciliation | commitment.missing_vs_onchain | > 0 |
| WAL monitoring | storage.wal_behind_bytes | > 1MB |
Response: Check storage health; recover from WAL or on-chain state.
OC-002: Chain Reorganization
| Detection Method | Metric/Signal | Alert Condition |
|---|---|---|
| Reorg detector | onchain.reorg_depth | > 0 |
| Confirmation reversal | commitment.confirmation_reverted | > 0 |
| Slot tracking | onchain.slot_regression | detected |
Response: Re-validate all recently confirmed commitments; rollback if needed.
7.3 Alerting Tiers
| Tier | Severity | Response Time | Escalation | Examples |
|---|---|---|---|---|
| P1 | Critical | < 15 minutes | Immediate page | Orphan detected, deadlock, data loss |
| P2 | High | < 1 hour | Page if not ack'd | High rollback rate, memory pressure |
| P3 | Medium | < 4 hours | Slack notification | Elevated latency, queue growth |
| P4 | Low | Next business day | Ticket | Deprecation warnings, minor anomalies |
7.4 Dashboard Panels
Speculation Health Dashboard
The dashboard displays six key metric cards and a time-series chart:
| Metric | Example Value | Status |
|---|---|---|
| Active Speculations | 1,247 | +12% vs avg |
| Rollback Rate | 0.3/min | Normal |
| Avg Depth | 2.4 | Below limit |
| Proof Queue | 234 (23%) | Healthy |
| Confirmation Lag | 8.2s avg | Normal |
| Memory Usage | 3.2 GB | Warning (80%) |
A time-series panel tracks Speculation Rate (tasks/min) over 24 hours for trend analysis.
7.5 Runbook Integration
Each high-risk failure mode maps to an operational runbook procedure:
| Failure Mode | Runbook Procedure |
|---|---|
| DG-001 (Cycle) | Cycle Detected |
| DG-002 (Stale) | Stale Dependency |
| CL-001 (Lost) | Lost Commitment |
| PDM-003 (Deadlock) | Proof Deadlock |
| RC-001 (Incomplete) | Incomplete Rollback |
| OC-002 (Reorg) | Chain Reorg |
Appendix A: Risk Register Summary
| Rank | ID | Risk | RPN | Status | Owner |
|---|---|---|---|---|---|
| 1 | DG-002 | Stale dependency data | 240 | Mitigating | Runtime Core |
| 2 | DG-004 | Concurrent modification race | 224 | Mitigating | Runtime Core |
| 3 | RC-003 | Missed task in rollback | 216 | Mitigating | Rollback Team |
| 4 | CL-001 | Lost commitment record | 162 | Mitigating | Storage Team |
| 5 | PDM-003 | Proof pipeline deadlock | 160 | Mitigating | Proof Team |
| 6 | RC-004 | Rollback during execution | 160 | Planned | Rollback Team |
| 7 | SS-001 | Wrong speculation decision | 150 | Planned | Scheduler Team |
| 8 | RC-001 | Incomplete rollback cascade | 150 | Planned | Rollback Team |
| 9 | CL-007 | Cross-shard inconsistency | 144 | Deferred | Storage Team |
| 10 | DG-003 | Memory leak | 140 | Planned | Runtime Core |
Appendix B: Review History
| Date | Reviewer | Changes |
|---|---|---|
| 2025-01-28 | Initial | Document created |