diff --git a/notes/bf-2f5.md b/notes/bf-2f5.md new file mode 100644 index 0000000..1daafdf --- /dev/null +++ b/notes/bf-2f5.md @@ -0,0 +1,174 @@ +# Watchdog Timeout Implementation (Bead bf-2f5) + +## Overview +This document describes the comprehensive watchdog timeout mechanism implemented in claude-print to prevent indefinite hangs when the child process wedges. + +## Implementation Location +- **Module**: `src/watchdog.rs` +- **Integration**: `src/session.rs` (lines 200-332) +- **CLI**: `src/cli.rs` (timeout configuration parameters) + +## Timeout Types + +### 1. PTY First-Output Timeout (`DEFAULT_PTY_TIMEOUT_SECS: 90s`) +- **Purpose**: Detects if child produces no PTY output within deadline +- **Detection**: Watchdog thread checks `pty_output_received` flag +- **Trigger**: Event loop marks flag when first PTY chunk arrives +- **Error**: `TimeoutType::PtyFirstOutput` + +### 2. Stream-JSON First-Output Timeout (`DEFAULT_STREAM_JSON_TIMEOUT_SECS: 90s`) +- **Purpose**: Detects if child emits no stream-json events within deadline +- **Detection**: Background thread monitors `/transcript.jsonl` for valid JSON +- **Trigger**: Sets `stream_json_output_received` when first JSON line detected +- **Error**: `TimeoutType::StreamJsonFirstOutput` + +### 3. Overall Timeout (`DEFAULT_OVERALL_TIMEOUT_SECS: 3600s`) +- **Purpose**: Maximum session duration +- **Detection**: Watchdog thread compares elapsed time against deadline +- **Trigger**: Configured via `--timeout` CLI flag +- **Error**: `TimeoutType::OverallTimeout` + +### 4. Stop Hook Timeout (`DEFAULT_STOP_HOOK_TIMEOUT_SECS: 120s`) +- **Purpose**: Detects if Stop hook doesn't fire after prompt injection +- **Detection**: Watchdog thread measures time since `mark_prompt_injected()` +- **Trigger**: Startup sequence calls `mark_prompt_injected()` when prompt sent +- **Error**: `TimeoutType::StopHookTimeout` + +## Timeout Handling Flow + +### 1. Timeout Thread Spawns +```rust +// session.rs:220-225 +let watchdog = Watchdog::new(watchdog_config, spawner.child_pid, ...); +let _timeout_thread = watchdog.spawn_timeout_thread(); +``` + +### 2. Event Loop Signals Progress +```rust +// session.rs:260-261 - PTY output +watchdog_state_clone.mark_pty_output(); + +// session.rs:303-305 - Prompt injection +if current_phase.is_prompt_injected() { + watchdog_state_clone.mark_prompt_injected(); +} +``` + +### 3. Timeout Fires +```rust +// watchdog.rs:288-299 - PTY timeout example +if elapsed >= Duration::from_secs(config.pty_first_output_timeout_secs) { + let _ = nix::sys::signal::kill(child_pid, nix::sys::signal::Signal::SIGTERM); + timeout_fired.store(true, Ordering::SeqCst); + // Signal event loop via self-pipe + unsafe { + let byte: [u8; 1] = [1]; + let _ = libc::write(fd, byte.as_ptr() as *const libc::c_void, 1); + } + return; +} +``` + +### 4. Event Loop Exits and Checks Timeout +```rust +// session.rs:322-332 +if watchdog_state.has_timeout_fired() { + let timeout_type = watchdog_state.get_timeout_type().unwrap(); + eprintln!("claude-print: {}", timeout_type.description()); + eprintln!("claude-print: sending SIGTERM to child pid {}", spawner.child_pid); + kill_child(spawner.child_pid); + return Err(Error::Timeout(timeout_msg.to_string())); +} +``` + +### 5. Child Cleanup (SIGTERM → SIGKILL) +```rust +// session.rs:399-419 +fn kill_child(pid: nix::unistd::Pid) { + let _ = nix::sys::signal::kill(pid, nix::sys::signal::Signal::SIGTERM); + + let deadline = Instant::now() + Duration::from_secs(2); + loop { + match nix::sys::wait::waitpid(pid, Some(WaitPidFlag::WNOHANG)) { + Ok(WaitStatus::StillAlive) => { + if Instant::now() >= deadline { + let _ = nix::sys::signal::kill(pid, nix::sys::signal::Signal::SIGKILL); + return; + } + thread::sleep(Duration::from_millis(50)); + } + _ => return, + } + } +} +``` + +### 6. Main Returns Error +```rust +// main.rs:202-212 +Err(Error::Timeout(_msg)) => { + let _ = emit_error(..., &ClaudePrintError::Timeout, ...); + exit_with_cleanup(ClaudePrintError::Timeout.exit_code()); // 124 +} +``` + +### 7. Cleanup Happens +```rust +// session.rs:55-75 +pub fn cleanup_temp_dir() { + if let Some(path) = TEMP_DIR_PATH.get() { + let _ = std::fs::remove_file(&path.join("stop.fifo")); + let _ = std::fs::remove_dir_all(path); + } +} +``` + +## Exit Codes +- **Timeout**: 124 (GNU timeout convention) +- **Interrupted**: 130 (128 + SIGINT) +- **Setup errors**: 2 +- **Assistant errors**: 1 + +## CLI Configuration +```bash +--timeout # Overall timeout (default: 3600) +--first-output-timeout # PTY first-output (default: 90) +--stream-json-timeout # Stream-json first-output (default: 90) +--stop-hook-timeout # Stop hook watchdog (default: 120) +``` + +## Integration Tests +See `tests/watchdog.rs`: +- `watchdog_silent_child_times_out_with_cleanup`: Verifies 2-second timeout fires cleanly +- `watchdog_one_second_timeout_fires_cleanly`: Verifies 1-second timeout fires quickly + +## Self-Pipe Signaling +The watchdog uses a self-pipe to wake the event loop when a timeout fires: +- Watchdog writes byte `[1]` to self-pipe write end +- Event loop wakes from `poll()` with POLLIN on self-pipe read end +- Event loop exits normally +- Session checks `watchdog_state.has_timeout_fired()` and returns timeout error + +## Stream-JSON Monitoring +A background thread monitors the transcript file: +```rust +fn spawn_stream_json_monitor_in_dir(temp_dir: PathBuf, ...) { + thread::spawn(move || { + let transcript_path = temp_dir.join("transcript.jsonl"); + loop { + if output_received.load(Ordering::SeqCst) { return; } + // Check file growth and parse JSON lines + // Set flag when first valid JSON found + thread::sleep(Duration::from_millis(100)); + } + }) +} +``` + +## Summary +The watchdog prevents indefinite hangs by: +1. Monitoring four independent timeout conditions +2. Sending SIGTERM → SIGKILL to child process +3. Writing clear diagnostics to stderr +4. Tearing down temp resources via CleanupGuard +5. Exiting non-zero (124) so caller can retry cleanly