trail-boss/docs/notes/decisions.md
jedarden 675c1e531d docs: gap-review round 1 — reconcile supporting docs + close spec gaps
Fix 10 gaps found by fresh-eyes review:
- CRITICAL: mechanics doc still documented send-keys-relay/Agent-SDK delivery
  and the Notification hook; rewrote DELIVER to the navigator model, dropped
  Notification throughout
- HIGH: decisions.md + related-work.md still referenced priority "ranking",
  /reply dispatch, send-keys/SDK delivery, and the dead plan/question/idle
  reason taxonomy; aligned to FIFO + navigation + permission/stopped
- collector and daemon stated as one process; pull-not-broadcast presentation
- added trust boundary (loopback-only ingest, single-host trust assumption)
- resolved auto-advance focus-steal hazard (operator-initiated jump) and
  specified skip re-ordering (move to tail + cooldown) to avoid livelock
- downgraded SessionEnd to "not yet probed" (probe only saw SessionStart/Stop)
- dropped forked collector's WebSocket layer; most-stuck → oldest-stuck (FIFO)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-25 15:18:56 -04:00

5.3 KiB

Decisions & rationale

Naming

Trail Boss — on a cattle drive, the trail boss is in overall command: sets the direction, makes the calls, and rides in when a steer bogs down or strays. The product runs a herd of agent sessions; when one gets stuck it reports in, and you — the trail boss — ride over and set it right. The metaphor maps cleanly onto the mechanism:

  • the herd grazing the range → sessions working autonomously
  • a steer bogs down or strays → a Stop / PermissionRequest hook fires; the collector flags the session stuck
  • the trail boss rides over and sets it right → you read the context and give the order (reply) or wave it on (skip); the queue surfaces stuck sessions oldest-first (flat FIFO, no priority ranking)

Names considered and rejected

  • agent-inbox — clearest literal description, but collides head-on with langchain-ai/agent-inbox, an existing human-in-the-loop inbox for LangGraph agents. Would read as derivative and lose every search.
  • agent-attention — names the value prop (your attention is the scarce resource being routed), but risks reading as the ML "attention" mechanism.
  • agent-central — self-explanatory but generic, and "central" reads like a passive dashboard/hub rather than an act-on-the-stuck-one tool.

Trail Boss keeps a memorable, distinctive identity; the tagline carries the legibility for newcomers.

Design decisions

Hooks, not polling

Detection is event-driven via Claude Code hooks. A session emits a signal the moment control returns to a human: while actively working it emits PreToolUse/PostToolUse, never Stop. A session counts as waiting only once Stop or PermissionRequest has fired and no UserPromptSubmit has come since. Confirmed by probe (2026-05-25): both SessionStart and Stop fire in interactive and -p modes, the Stop payload carries last_assistant_message (queue context for free), and hook commands inherit the ambient environment.

Stuck = needs attention, and stuck is stuck

A session that has stopped or is waiting at a permission prompt cannot progress until the human responds — so it needs intervention by definition. Two collapses follow: there is no "finished but fine" state (every stop is a queue item), and there is no permission-vs-stopped priority (it doesn't matter why it's stuck). Stop and PermissionRequest are both required detection triggers — a permission-blocked session is mid-turn and emits no Stop — but they're treated identically; reason is display-only and the queue is a flat FIFO dead-letter queue. Notification is dropped (it adds nothing those two miss). The operator simply depletes the queue, and the next stuck session auto-loads.

Navigator, not relay (the delivery model)

Trail Boss routes attention, it does not inject input. Sessions stay as long-running live CLIs in tmux panes (Model A), and delivery happens by navigating the operator to the live pane (switch-client/select-window/select-pane, optionally link-window to co-display) where they interact with the real prompt directly. This dissolves the send-keys fidelity problem, makes "edit before allow" native (you just type), and means no synthesized input ever reaches a session.

Rejected delivery alternatives:

  • Resume-to-deliver (claude --resume <id> in a second process): a live interactive CLI holds in-memory state and does not re-read its transcript, so a resumed process's reply never reaches the original pane; concurrent attach risks transcript divergence. --fork-session confirms plain --resume reuses the session. Only viable in a no-resident-process model (Model B), which we rejected for v1.
  • send-keys relay as the primary path: retained only as a secondary plain-text option (basic submission confirmed working); native interaction is preferred.
  • claude --remote-control: routes to the claude.ai / desktop / mobile surface, not a local channel — useless for a same-host tool.
  • Agent SDK canUseTool (Model B): programmatic permission gating with updatedInput is attractive, but requires running sessions under the SDK instead of the terminal — deferred; the tmux-navigator model fits the existing workflow and the durability requirement.

Same-host daemon, durable via tmux

Trail Boss does not need to live inside tmux to drive it — tmux is client/server, so any same-user process issues tmux commands to the server (pane ids are server-global). The control plane is an always-on daemon; presentation is transient (display-popup + keybinding). But for durability across SSH disconnect the daemon must survive SIGHUP, so it runs in its own tmux window (simplest) or under systemd --user (also survives reboot; tmux does not). Agents already persist because the tmux server is host-side. While disconnected, the daemon and hooks keep running, so the queue accumulates the backlog and disconnecting becomes a non-event.

The transcript is ground truth

Hooks are a low-latency notification; the transcript JSONL is authoritative. A reconcile loop corrects dropped hook POSTs, daemon restarts, and "answered directly in the pane" by checking whether a session's transcript has advanced past its last Stop.