Orchestration Loop
The orchestration loop is the engine that drives Stoneforge. It’s a continuously running background process — the Dispatch Daemon — that connects Directors, Workers, and Stewards by polling for work, assigning tasks, routing messages, triggering workflows, and recovering from failures.
The big picture
Here’s the end-to-end flow from a human goal to merged code on main:
┌─────────────────────────────────────┐ │ HUMAN │ │ (goals, requests, supervision) │ └─────────────────┬───────────────────┘ │ sends message / goal │ ▼ ┌─────────────────────────────────────┐ │ DIRECTOR │ │ - Creates plans │ │ - Creates tasks with priorities │ │ - Sets dependencies │ └─────────────────┬───────────────────┘ │ creates tasks │ ▼┌──────────────────────────────────────────────────────────────────────────────────┐│ DISPATCH DAEMON (continuous polling) ││ ││ ┌───────────────┐ ┌───────────────┐ ┌───────────────┐ ┌───────────────┐ ││ │ Worker │ │ Message │ │ Steward │ │ Workflow │ ││ │ Availability │ │ Routing │ │ Trigger │ │ Task │ ││ │ Polling │ │ │ │ Polling │ │ Polling │ ││ └───────┬───────┘ └───────┬───────┘ └───────┬───────┘ └───────┬───────┘ │└──────────┼──────────────────┼──────────────────┼──────────────────┼──────────────┘ │ │ │ │ ▼ ▼ ▼ ▼┌───────────────┐ ┌───────────────┐ ┌───────────────┐ ┌───────────────┐│ EPHEMERAL │ │ Route msgs │ │ Create │ │ Dispatch to ││ WORKERS │ │ by agent │ │ workflow from │ │ available ││ │ │ role; spawn │ │ triggered │ │ steward ││ - Task exec │ │ triage for │ │ playbook │ │ ││ - Commit/push │ │ idle agents │ │ │ │ ││ - Close/hoff │ │ │ │ │ │ │└───────┬───────┘ └───────────────┘ └───────┬───────┘ └───────┬───────┘ │ │ │ │ └────────┬─────────┘ │ creates merge request │ │ on completion ▼ │ ┌─────────────────────────────┐ └─────────────────────────────▶│ STEWARDS │ │ - Merge review & testing │ │ - Docs maintenance │ └──────────────┬──────────────┘ │ merge or create fix task │ ▼ ┌─────────────────────────────┐ │ MAIN BRANCH │ │ (merged, clean code) │ └─────────────────────────────┘Flow summary
- Human sends a goal to the Director via the Director Panel
- Director creates a plan with tasks, priorities, and dependencies
- Dispatch Daemon continuously polls for ready work in multiple loops
- Ephemeral Workers execute tasks in isolated worktrees, commit, push, and close
- Stewards handle merge review, run tests, and squash-merge passing branches
- Completed work lands on the main branch
The daemon is the connective tissue. Without it, tasks would sit unassigned and messages would go undelivered.
The Dispatch Daemon
The daemon is a single process that runs multiple polling loops on a configurable interval (default: 5 seconds). Each cycle runs the loops in a fixed order:
┌─────────────────────────────────────────────────┐│ DAEMON POLL CYCLE ││ ││ 1. Orphan Recovery Polling ││ (recover workers from server restart) ││ │ ││ ▼ ││ 2. Message Routing ││ (route inbox messages, spawn triage) ││ │ ││ ▼ ││ 3. Worker Availability Polling ││ (assign tasks to idle workers) ││ │ ││ ▼ ││ 4. Steward Trigger Polling ││ (activate playbooks on events) ││ │ ││ ▼ ││ 5. Workflow Task Polling ││ (assign workflow tasks to stewards) ││ │ ││ ▼ ││ 6. Closed-Unmerged Reconciliation ││ (recover stuck tasks) ││ │ ││ ▼ ││ 7. Stuck Merge Recovery ││ (reset stalled merging/testing tasks) ││ │ ││ ▼ ││ 8. Plan Auto-Completion ││ (complete plans when all tasks closed) ││ ││ ─── sleep(interval) ─── repeat ─── │└─────────────────────────────────────────────────┘The ordering matters. Orphan recovery runs first to handle any workers that were interrupted. Message routing runs before worker availability so that agents with unread triage messages are excluded from new task assignment.
Worker availability polling
This is the core task-assignment loop. On each cycle:
- Find all registered ephemeral workers without an active session
- Skip workers with unread inbox messages (they need triage first)
- For each available worker, query for the highest-priority unassigned task that isn’t blocked
- Assign the task to the worker
- Send a dispatch message to the worker’s inbox with the full task context
The dispatch message includes everything the worker needs to start:
## Task Assignment
Worker ID: e-worker-1Director ID: directorTask ID: el-3a8fTitle: Add OAuth loginPriority: 1
### Description{full description content, including any handoff notes from previous workers}
### Acceptance Criteria{acceptance criteria if present}
### Instructions1. Read the task title and acceptance criteria carefully.2. Complete the task: make changes, commit, push, then run: sf task complete el-3a8fMessage routing
The daemon routes messages differently based on agent role:
| Agent type | Has active session | Behavior |
|---|---|---|
| Director | Yes | Forward as user input (with idle debounce) |
| Director | No | Messages accumulate until session starts |
| Ephemeral Worker | Yes | Leave unread — active session will handle it |
| Ephemeral Worker | No | Accumulate for triage batch |
| Persistent Worker | Yes | Forward as user input in real-time |
| Persistent Worker | No | Wait until session starts |
| Steward | Yes | Leave unread — active session will handle it |
| Steward | No | Accumulate for triage batch |
Message triage
When an idle agent (no active session) has accumulated unread non-dispatch messages, the daemon spawns a triage session to process them. This ensures agents don’t miss messages that arrive while they’re between tasks.
Key rules:
- Messages are grouped by originating channel
- Only one triage session per agent per poll cycle
- Triage sessions operate in a temporary detached worktree
- Triage takes priority over new task assignment — an agent with pending triage won’t receive a new task until the triage is complete
Steward triggers and workflows
Stewards are activated by triggers, not by direct task assignment. The trigger polling loop watches for conditions that should activate a steward:
- Check for triggered conditions (e.g., a task moved to REVIEW status)
- For each triggered condition, create a new workflow from the associated playbook template
- The workflow task is picked up by workflow task polling, which assigns it to an available steward
This two-step process (trigger → workflow → assignment) means steward work goes through the same dispatch machinery as everything else. It also means workflows are durable — if a steward fails mid-workflow, the workflow can be resumed by another steward.
Orphan recovery
When the orchestrator server restarts while ephemeral workers are mid-task, those workers lose their processes but their task assignments persist. Without recovery, these tasks would be stuck forever — assigned to workers that no longer exist.
The problem:
Before restart: e-worker-1 → session running → task "el-3a8f" in progress
After restart: e-worker-1 → session idle → task "el-3a8f" still assigned, status IN_PROGRESS (worker availability polling skips this worker — it already has an assigned task)The solution:
Orphan recovery runs at the start of each poll cycle. For each worker with an assigned task but no active session:
- Try resume first — If a previous
sessionIdexists in the task metadata, attempt to resume that session with a prompt explaining the restart - Fall back to fresh spawn — If no session ID or resume fails, spawn a fresh session with the full task prompt
- Reuse existing worktree — If the original worktree still exists, the worker continues from the existing code state
Closed-unmerged reconciliation
This is a safety net for tasks that end up in an inconsistent state. A task can reach CLOSED status without being merged — for example, if sf task close is run on a REVIEW task, or due to race conditions between CLI commands and steward processing.
These tasks appear in the “Awaiting Merge” section of the web dashboard but are invisible to merge stewards, which only query for status: REVIEW.
The reconciliation loop:
- Query for tasks with
status: CLOSEDand non-mergedmergeStatus - Skip tasks closed within the grace period (default: 120 seconds) to avoid racing with in-progress close+merge sequences
- Skip tasks with
reconciliationCount >= 3as a safety valve against infinite loops - Move the task back to
REVIEWstatus and increment the reconciliation counter
This ensures stuck tasks eventually get picked up by a merge steward.
Stuck merge recovery
Merge operations can stall due to crashes, timeouts, or race conditions — leaving tasks with a merging or testing mergeStatus indefinitely. These tasks are invisible to the merge steward because it only picks up tasks with pending mergeStatus.
The stuck merge recovery loop:
- Query for tasks with
mergeStatusofmergingortesting - Skip tasks within the grace period (default: 10 minutes) to avoid racing with in-progress merge operations
- Skip tasks with a recovery count >= 3 as a safety valve
- Reset the task’s
mergeStatusback topendingand increment the recovery counter
This ensures stalled merges don’t block the pipeline permanently.
Plan auto-completion
When all tasks in an active plan reach CLOSED status, there’s no need for a human to manually close the plan. The plan auto-completion loop handles this automatically:
- Query for all plans with
activestatus - For each plan, check if every non-tombstone child task is
CLOSED - If all tasks are closed, mark the plan as
completed
This uses the core canAutoComplete() function to determine eligibility, ensuring consistent behavior between the daemon and manual plan management.