Add comprehensive documentation for the watchdog timeout mechanism that prevents indefinite hangs in claude-print. The implementation includes: - PTY first-output timeout (90s default) - Stream-json first-output timeout (90s default) - Overall timeout (3600s default) - Stop hook timeout (120s default) Each timeout type triggers SIGTERM→SIGKILL child cleanup, writes clear diagnostics to stderr, and exits with code 124. Co-Authored-By: Claude <noreply@anthropic.com>
5.9 KiB
5.9 KiB
Watchdog Timeout Implementation (Bead bf-2f5)
Overview
This document describes the comprehensive watchdog timeout mechanism implemented in claude-print to prevent indefinite hangs when the child process wedges.
Implementation Location
- Module:
src/watchdog.rs - Integration:
src/session.rs(lines 200-332) - CLI:
src/cli.rs(timeout configuration parameters)
Timeout Types
1. PTY First-Output Timeout (DEFAULT_PTY_TIMEOUT_SECS: 90s)
- Purpose: Detects if child produces no PTY output within deadline
- Detection: Watchdog thread checks
pty_output_receivedflag - Trigger: Event loop marks flag when first PTY chunk arrives
- Error:
TimeoutType::PtyFirstOutput
2. Stream-JSON First-Output Timeout (DEFAULT_STREAM_JSON_TIMEOUT_SECS: 90s)
- Purpose: Detects if child emits no stream-json events within deadline
- Detection: Background thread monitors
<temp_dir>/transcript.jsonlfor valid JSON - Trigger: Sets
stream_json_output_receivedwhen first JSON line detected - Error:
TimeoutType::StreamJsonFirstOutput
3. Overall Timeout (DEFAULT_OVERALL_TIMEOUT_SECS: 3600s)
- Purpose: Maximum session duration
- Detection: Watchdog thread compares elapsed time against deadline
- Trigger: Configured via
--timeoutCLI flag - Error:
TimeoutType::OverallTimeout
4. Stop Hook Timeout (DEFAULT_STOP_HOOK_TIMEOUT_SECS: 120s)
- Purpose: Detects if Stop hook doesn't fire after prompt injection
- Detection: Watchdog thread measures time since
mark_prompt_injected() - Trigger: Startup sequence calls
mark_prompt_injected()when prompt sent - Error:
TimeoutType::StopHookTimeout
Timeout Handling Flow
1. Timeout Thread Spawns
// session.rs:220-225
let watchdog = Watchdog::new(watchdog_config, spawner.child_pid, ...);
let _timeout_thread = watchdog.spawn_timeout_thread();
2. Event Loop Signals Progress
// session.rs:260-261 - PTY output
watchdog_state_clone.mark_pty_output();
// session.rs:303-305 - Prompt injection
if current_phase.is_prompt_injected() {
watchdog_state_clone.mark_prompt_injected();
}
3. Timeout Fires
// watchdog.rs:288-299 - PTY timeout example
if elapsed >= Duration::from_secs(config.pty_first_output_timeout_secs) {
let _ = nix::sys::signal::kill(child_pid, nix::sys::signal::Signal::SIGTERM);
timeout_fired.store(true, Ordering::SeqCst);
// Signal event loop via self-pipe
unsafe {
let byte: [u8; 1] = [1];
let _ = libc::write(fd, byte.as_ptr() as *const libc::c_void, 1);
}
return;
}
4. Event Loop Exits and Checks Timeout
// session.rs:322-332
if watchdog_state.has_timeout_fired() {
let timeout_type = watchdog_state.get_timeout_type().unwrap();
eprintln!("claude-print: {}", timeout_type.description());
eprintln!("claude-print: sending SIGTERM to child pid {}", spawner.child_pid);
kill_child(spawner.child_pid);
return Err(Error::Timeout(timeout_msg.to_string()));
}
5. Child Cleanup (SIGTERM → SIGKILL)
// session.rs:399-419
fn kill_child(pid: nix::unistd::Pid) {
let _ = nix::sys::signal::kill(pid, nix::sys::signal::Signal::SIGTERM);
let deadline = Instant::now() + Duration::from_secs(2);
loop {
match nix::sys::wait::waitpid(pid, Some(WaitPidFlag::WNOHANG)) {
Ok(WaitStatus::StillAlive) => {
if Instant::now() >= deadline {
let _ = nix::sys::signal::kill(pid, nix::sys::signal::Signal::SIGKILL);
return;
}
thread::sleep(Duration::from_millis(50));
}
_ => return,
}
}
}
6. Main Returns Error
// main.rs:202-212
Err(Error::Timeout(_msg)) => {
let _ = emit_error(..., &ClaudePrintError::Timeout, ...);
exit_with_cleanup(ClaudePrintError::Timeout.exit_code()); // 124
}
7. Cleanup Happens
// session.rs:55-75
pub fn cleanup_temp_dir() {
if let Some(path) = TEMP_DIR_PATH.get() {
let _ = std::fs::remove_file(&path.join("stop.fifo"));
let _ = std::fs::remove_dir_all(path);
}
}
Exit Codes
- Timeout: 124 (GNU timeout convention)
- Interrupted: 130 (128 + SIGINT)
- Setup errors: 2
- Assistant errors: 1
CLI Configuration
--timeout <seconds> # Overall timeout (default: 3600)
--first-output-timeout <seconds> # PTY first-output (default: 90)
--stream-json-timeout <seconds> # Stream-json first-output (default: 90)
--stop-hook-timeout <seconds> # Stop hook watchdog (default: 120)
Integration Tests
See tests/watchdog.rs:
watchdog_silent_child_times_out_with_cleanup: Verifies 2-second timeout fires cleanlywatchdog_one_second_timeout_fires_cleanly: Verifies 1-second timeout fires quickly
Self-Pipe Signaling
The watchdog uses a self-pipe to wake the event loop when a timeout fires:
- Watchdog writes byte
[1]to self-pipe write end - Event loop wakes from
poll()with POLLIN on self-pipe read end - Event loop exits normally
- Session checks
watchdog_state.has_timeout_fired()and returns timeout error
Stream-JSON Monitoring
A background thread monitors the transcript file:
fn spawn_stream_json_monitor_in_dir(temp_dir: PathBuf, ...) {
thread::spawn(move || {
let transcript_path = temp_dir.join("transcript.jsonl");
loop {
if output_received.load(Ordering::SeqCst) { return; }
// Check file growth and parse JSON lines
// Set flag when first valid JSON found
thread::sleep(Duration::from_millis(100));
}
})
}
Summary
The watchdog prevents indefinite hangs by:
- Monitoring four independent timeout conditions
- Sending SIGTERM → SIGKILL to child process
- Writing clear diagnostics to stderr
- Tearing down temp resources via CleanupGuard
- Exiting non-zero (124) so caller can retry cleanly