Fix 10 gaps found by fresh-eyes review: - CRITICAL: mechanics doc still documented send-keys-relay/Agent-SDK delivery and the Notification hook; rewrote DELIVER to the navigator model, dropped Notification throughout - HIGH: decisions.md + related-work.md still referenced priority "ranking", /reply dispatch, send-keys/SDK delivery, and the dead plan/question/idle reason taxonomy; aligned to FIFO + navigation + permission/stopped - collector and daemon stated as one process; pull-not-broadcast presentation - added trust boundary (loopback-only ingest, single-host trust assumption) - resolved auto-advance focus-steal hazard (operator-initiated jump) and specified skip re-ordering (move to tail + cooldown) to avoid livelock - downgraded SessionEnd to "not yet probed" (probe only saw SessionStart/Stop) - dropped forked collector's WebSocket layer; most-stuck → oldest-stuck (FIFO) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
90 lines
5.3 KiB
Markdown
90 lines
5.3 KiB
Markdown
# Decisions & rationale
|
|
|
|
## Naming
|
|
|
|
**Trail Boss** — on a cattle drive, the trail boss is in overall command: sets the direction,
|
|
makes the calls, and rides in when a steer bogs down or strays. The product runs a *herd* of
|
|
agent sessions; when one gets stuck it reports in, and you — the trail boss — ride over and set
|
|
it right. The metaphor maps cleanly onto the mechanism:
|
|
|
|
- **the herd grazing the range** → sessions working autonomously
|
|
- **a steer bogs down or strays** → a `Stop` / `PermissionRequest` hook fires; the collector
|
|
flags the session stuck
|
|
- **the trail boss rides over and sets it right** → you read the context and give the order
|
|
(reply) or wave it on (skip); the queue surfaces stuck sessions oldest-first (flat FIFO, no
|
|
priority ranking)
|
|
|
|
### Names considered and rejected
|
|
|
|
- **`agent-inbox`** — clearest literal description, but collides head-on with
|
|
[`langchain-ai/agent-inbox`](https://github.com/langchain-ai/agent-inbox), an existing
|
|
human-in-the-loop inbox for LangGraph agents. Would read as derivative and lose every search.
|
|
- **`agent-attention`** — names the value prop (your attention is the scarce resource being
|
|
routed), but risks reading as the ML "attention" mechanism.
|
|
- **`agent-central`** — self-explanatory but generic, and "central" reads like a passive
|
|
dashboard/hub rather than an act-on-the-stuck-one tool.
|
|
|
|
Trail Boss keeps a memorable, distinctive identity; the tagline carries the legibility for
|
|
newcomers.
|
|
|
|
## Design decisions
|
|
|
|
### Hooks, not polling
|
|
|
|
Detection is event-driven via Claude Code hooks. A session emits a signal the moment control
|
|
returns to a human: while actively working it emits `PreToolUse`/`PostToolUse`, never `Stop`.
|
|
A session counts as waiting only once `Stop` or `PermissionRequest` has fired and no
|
|
`UserPromptSubmit` has come since. **Confirmed by probe (2026-05-25):** both `SessionStart` and
|
|
`Stop` fire in interactive and `-p` modes, the `Stop` payload carries `last_assistant_message`
|
|
(queue context for free), and hook commands inherit the ambient environment.
|
|
|
|
### Stuck = needs attention, and stuck is stuck
|
|
|
|
A session that has stopped or is waiting at a permission prompt cannot progress until the human
|
|
responds — so it needs intervention by definition. Two collapses follow: there is no
|
|
"finished but fine" state (every stop is a queue item), and there is no permission-vs-stopped
|
|
priority (it doesn't matter *why* it's stuck). `Stop` and `PermissionRequest` are both required
|
|
detection triggers — a permission-blocked session is mid-turn and emits no `Stop` — but they're
|
|
treated identically; `reason` is display-only and the queue is a flat FIFO dead-letter queue.
|
|
`Notification` is dropped (it adds nothing those two miss). The operator simply depletes the
|
|
queue, and the next stuck session auto-loads.
|
|
|
|
### Navigator, not relay (the delivery model)
|
|
|
|
Trail Boss routes *attention*, it does not inject *input*. Sessions stay as long-running live
|
|
CLIs in tmux panes (Model A), and delivery happens by navigating the operator to the live pane
|
|
(`switch-client`/`select-window`/`select-pane`, optionally `link-window` to co-display) where
|
|
they interact with the real prompt directly. This dissolves the send-keys fidelity problem,
|
|
makes "edit before allow" native (you just type), and means no synthesized input ever reaches a
|
|
session.
|
|
|
|
Rejected delivery alternatives:
|
|
|
|
- **Resume-to-deliver** (`claude --resume <id>` in a second process): a live interactive CLI
|
|
holds in-memory state and does not re-read its transcript, so a resumed process's reply never
|
|
reaches the original pane; concurrent attach risks transcript divergence. `--fork-session`
|
|
confirms plain `--resume` reuses the session. Only viable in a no-resident-process model
|
|
(Model B), which we rejected for v1.
|
|
- **`send-keys` relay as the primary path:** retained only as a secondary plain-text option
|
|
(basic submission confirmed working); native interaction is preferred.
|
|
- **`claude --remote-control`:** routes to the claude.ai / desktop / mobile surface, not a local
|
|
channel — useless for a same-host tool.
|
|
- **Agent SDK `canUseTool` (Model B):** programmatic permission gating with `updatedInput` is
|
|
attractive, but requires running sessions under the SDK instead of the terminal — deferred;
|
|
the tmux-navigator model fits the existing workflow and the durability requirement.
|
|
|
|
### Same-host daemon, durable via tmux
|
|
|
|
Trail Boss does not need to live *inside* tmux to drive it — tmux is client/server, so any
|
|
same-user process issues `tmux` commands to the server (pane ids are server-global). The
|
|
control plane is an always-on daemon; presentation is transient (`display-popup` + keybinding).
|
|
But for durability across SSH disconnect the daemon must survive SIGHUP, so it runs **in its own
|
|
tmux window** (simplest) or under **`systemd --user`** (also survives reboot; tmux does not).
|
|
Agents already persist because the tmux server is host-side. While disconnected, the daemon and
|
|
hooks keep running, so the queue accumulates the backlog and disconnecting becomes a non-event.
|
|
|
|
### The transcript is ground truth
|
|
|
|
Hooks are a low-latency notification; the transcript JSONL is authoritative. A reconcile loop
|
|
corrects dropped hook POSTs, daemon restarts, and "answered directly in the pane" by checking
|
|
whether a session's transcript has advanced past its last `Stop`.
|