docs(bf-2f5): document watchdog timeout implementation
Add comprehensive documentation for the watchdog timeout mechanism that prevents indefinite hangs in claude-print. The implementation includes: - PTY first-output timeout (90s default) - Stream-json first-output timeout (90s default) - Overall timeout (3600s default) - Stop hook timeout (120s default) Each timeout type triggers SIGTERM→SIGKILL child cleanup, writes clear diagnostics to stderr, and exits with code 124. Co-Authored-By: Claude <noreply@anthropic.com>
This commit is contained in:
parent
a19e2b0aed
commit
11e9b72967
1 changed files with 174 additions and 0 deletions
174
notes/bf-2f5.md
Normal file
174
notes/bf-2f5.md
Normal file
|
|
@ -0,0 +1,174 @@
|
|||
# Watchdog Timeout Implementation (Bead bf-2f5)
|
||||
|
||||
## Overview
|
||||
This document describes the comprehensive watchdog timeout mechanism implemented in claude-print to prevent indefinite hangs when the child process wedges.
|
||||
|
||||
## Implementation Location
|
||||
- **Module**: `src/watchdog.rs`
|
||||
- **Integration**: `src/session.rs` (lines 200-332)
|
||||
- **CLI**: `src/cli.rs` (timeout configuration parameters)
|
||||
|
||||
## Timeout Types
|
||||
|
||||
### 1. PTY First-Output Timeout (`DEFAULT_PTY_TIMEOUT_SECS: 90s`)
|
||||
- **Purpose**: Detects if child produces no PTY output within deadline
|
||||
- **Detection**: Watchdog thread checks `pty_output_received` flag
|
||||
- **Trigger**: Event loop marks flag when first PTY chunk arrives
|
||||
- **Error**: `TimeoutType::PtyFirstOutput`
|
||||
|
||||
### 2. Stream-JSON First-Output Timeout (`DEFAULT_STREAM_JSON_TIMEOUT_SECS: 90s`)
|
||||
- **Purpose**: Detects if child emits no stream-json events within deadline
|
||||
- **Detection**: Background thread monitors `<temp_dir>/transcript.jsonl` for valid JSON
|
||||
- **Trigger**: Sets `stream_json_output_received` when first JSON line detected
|
||||
- **Error**: `TimeoutType::StreamJsonFirstOutput`
|
||||
|
||||
### 3. Overall Timeout (`DEFAULT_OVERALL_TIMEOUT_SECS: 3600s`)
|
||||
- **Purpose**: Maximum session duration
|
||||
- **Detection**: Watchdog thread compares elapsed time against deadline
|
||||
- **Trigger**: Configured via `--timeout` CLI flag
|
||||
- **Error**: `TimeoutType::OverallTimeout`
|
||||
|
||||
### 4. Stop Hook Timeout (`DEFAULT_STOP_HOOK_TIMEOUT_SECS: 120s`)
|
||||
- **Purpose**: Detects if Stop hook doesn't fire after prompt injection
|
||||
- **Detection**: Watchdog thread measures time since `mark_prompt_injected()`
|
||||
- **Trigger**: Startup sequence calls `mark_prompt_injected()` when prompt sent
|
||||
- **Error**: `TimeoutType::StopHookTimeout`
|
||||
|
||||
## Timeout Handling Flow
|
||||
|
||||
### 1. Timeout Thread Spawns
|
||||
```rust
|
||||
// session.rs:220-225
|
||||
let watchdog = Watchdog::new(watchdog_config, spawner.child_pid, ...);
|
||||
let _timeout_thread = watchdog.spawn_timeout_thread();
|
||||
```
|
||||
|
||||
### 2. Event Loop Signals Progress
|
||||
```rust
|
||||
// session.rs:260-261 - PTY output
|
||||
watchdog_state_clone.mark_pty_output();
|
||||
|
||||
// session.rs:303-305 - Prompt injection
|
||||
if current_phase.is_prompt_injected() {
|
||||
watchdog_state_clone.mark_prompt_injected();
|
||||
}
|
||||
```
|
||||
|
||||
### 3. Timeout Fires
|
||||
```rust
|
||||
// watchdog.rs:288-299 - PTY timeout example
|
||||
if elapsed >= Duration::from_secs(config.pty_first_output_timeout_secs) {
|
||||
let _ = nix::sys::signal::kill(child_pid, nix::sys::signal::Signal::SIGTERM);
|
||||
timeout_fired.store(true, Ordering::SeqCst);
|
||||
// Signal event loop via self-pipe
|
||||
unsafe {
|
||||
let byte: [u8; 1] = [1];
|
||||
let _ = libc::write(fd, byte.as_ptr() as *const libc::c_void, 1);
|
||||
}
|
||||
return;
|
||||
}
|
||||
```
|
||||
|
||||
### 4. Event Loop Exits and Checks Timeout
|
||||
```rust
|
||||
// session.rs:322-332
|
||||
if watchdog_state.has_timeout_fired() {
|
||||
let timeout_type = watchdog_state.get_timeout_type().unwrap();
|
||||
eprintln!("claude-print: {}", timeout_type.description());
|
||||
eprintln!("claude-print: sending SIGTERM to child pid {}", spawner.child_pid);
|
||||
kill_child(spawner.child_pid);
|
||||
return Err(Error::Timeout(timeout_msg.to_string()));
|
||||
}
|
||||
```
|
||||
|
||||
### 5. Child Cleanup (SIGTERM → SIGKILL)
|
||||
```rust
|
||||
// session.rs:399-419
|
||||
fn kill_child(pid: nix::unistd::Pid) {
|
||||
let _ = nix::sys::signal::kill(pid, nix::sys::signal::Signal::SIGTERM);
|
||||
|
||||
let deadline = Instant::now() + Duration::from_secs(2);
|
||||
loop {
|
||||
match nix::sys::wait::waitpid(pid, Some(WaitPidFlag::WNOHANG)) {
|
||||
Ok(WaitStatus::StillAlive) => {
|
||||
if Instant::now() >= deadline {
|
||||
let _ = nix::sys::signal::kill(pid, nix::sys::signal::Signal::SIGKILL);
|
||||
return;
|
||||
}
|
||||
thread::sleep(Duration::from_millis(50));
|
||||
}
|
||||
_ => return,
|
||||
}
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
### 6. Main Returns Error
|
||||
```rust
|
||||
// main.rs:202-212
|
||||
Err(Error::Timeout(_msg)) => {
|
||||
let _ = emit_error(..., &ClaudePrintError::Timeout, ...);
|
||||
exit_with_cleanup(ClaudePrintError::Timeout.exit_code()); // 124
|
||||
}
|
||||
```
|
||||
|
||||
### 7. Cleanup Happens
|
||||
```rust
|
||||
// session.rs:55-75
|
||||
pub fn cleanup_temp_dir() {
|
||||
if let Some(path) = TEMP_DIR_PATH.get() {
|
||||
let _ = std::fs::remove_file(&path.join("stop.fifo"));
|
||||
let _ = std::fs::remove_dir_all(path);
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
## Exit Codes
|
||||
- **Timeout**: 124 (GNU timeout convention)
|
||||
- **Interrupted**: 130 (128 + SIGINT)
|
||||
- **Setup errors**: 2
|
||||
- **Assistant errors**: 1
|
||||
|
||||
## CLI Configuration
|
||||
```bash
|
||||
--timeout <seconds> # Overall timeout (default: 3600)
|
||||
--first-output-timeout <seconds> # PTY first-output (default: 90)
|
||||
--stream-json-timeout <seconds> # Stream-json first-output (default: 90)
|
||||
--stop-hook-timeout <seconds> # Stop hook watchdog (default: 120)
|
||||
```
|
||||
|
||||
## Integration Tests
|
||||
See `tests/watchdog.rs`:
|
||||
- `watchdog_silent_child_times_out_with_cleanup`: Verifies 2-second timeout fires cleanly
|
||||
- `watchdog_one_second_timeout_fires_cleanly`: Verifies 1-second timeout fires quickly
|
||||
|
||||
## Self-Pipe Signaling
|
||||
The watchdog uses a self-pipe to wake the event loop when a timeout fires:
|
||||
- Watchdog writes byte `[1]` to self-pipe write end
|
||||
- Event loop wakes from `poll()` with POLLIN on self-pipe read end
|
||||
- Event loop exits normally
|
||||
- Session checks `watchdog_state.has_timeout_fired()` and returns timeout error
|
||||
|
||||
## Stream-JSON Monitoring
|
||||
A background thread monitors the transcript file:
|
||||
```rust
|
||||
fn spawn_stream_json_monitor_in_dir(temp_dir: PathBuf, ...) {
|
||||
thread::spawn(move || {
|
||||
let transcript_path = temp_dir.join("transcript.jsonl");
|
||||
loop {
|
||||
if output_received.load(Ordering::SeqCst) { return; }
|
||||
// Check file growth and parse JSON lines
|
||||
// Set flag when first valid JSON found
|
||||
thread::sleep(Duration::from_millis(100));
|
||||
}
|
||||
})
|
||||
}
|
||||
```
|
||||
|
||||
## Summary
|
||||
The watchdog prevents indefinite hangs by:
|
||||
1. Monitoring four independent timeout conditions
|
||||
2. Sending SIGTERM → SIGKILL to child process
|
||||
3. Writing clear diagnostics to stderr
|
||||
4. Tearing down temp resources via CleanupGuard
|
||||
5. Exiting non-zero (124) so caller can retry cleanly
|
||||
Loading…
Add table
Reference in a new issue