claude-print/notes/bf-2f5-verification.md
jedarden 5826607cf7 docs(bf-2f5): verify watchdog timeout implementation is complete
All requirements from bead bf-2f5 have been verified:
- No-output timeout (PTY 90s, stream-json 90s) ✓
- Max-turn timeout (overall 3600s, stop hook 120s) ✓
- SIGTERM → SIGKILL with descendants ✓
- Clear diagnostics to stderr ✓
- Temp resource teardown ✓
- Exit non-zero (124) ✓

Implementation was completed in commits:
- 7d40c93: add comprehensive watchdog timeout mechanism
- 07013f8: add self-pipe signaling
- ea162c0: correct timeout exit code from 3 to 124

Co-Authored-By: Claude <noreply@anthropic.com>
2026-06-25 10:46:51 -04:00

6 KiB

Bead bf-2f5: Watchdog Timeout Implementation - VERIFICATION

Task Summary

Add watchdog: no-output + max-turn timeout that kills child and exits non-zero (never poll stop.fifo forever)

Implementation Status: COMPLETE

This bead has been fully implemented in previous commits:

  • 7d40c93 - feat(bf-2f5): add comprehensive watchdog timeout mechanism
  • 07013f8 - feat(bf-2w7): add self-pipe signaling to watchdog timeout mechanism
  • ea162c0 - fix(bf-2f5): correct timeout exit code from 3 to 124
  • 11e9b72 - docs(bf-2f5): document watchdog timeout implementation

Verification of Requirements

1. Startup/First-Output Timeout (90s configurable)

Implementation: src/watchdog.rs:18-24

  • PTY first-output timeout: 90s default (DEFAULT_PTY_TIMEOUT_SECS)
  • Stream-json first-output timeout: 90s default (DEFAULT_STREAM_JSON_TIMEOUT_SECS)
  • Configurable via CLI flags --first-output-timeout and --stream-json-timeout

Code Location: src/watchdog.rs:285-317

// Check Phase 1: PTY first-output timeout
if config.pty_first_output_timeout_secs > 0 && !has_pty_output {
    if elapsed >= Duration::from_secs(config.pty_first_output_timeout_secs) {
        // SIGTERM, signal event loop, return
    }
}

// Check Phase 2: Stream-json first-output timeout
if config.stream_json_first_output_timeout_secs > 0 && !has_stream_json_output {
    if elapsed >= Duration::from_secs(config.stream_json_first_output_timeout_secs) {
        // SIGTERM, signal event loop, return
    }
}

2. Overall Max-Turn Timeout

Implementation: src/watchdog.rs:26-31

  • Overall timeout: 3600s default (DEFAULT_OVERALL_TIMEOUT_SECS)
  • Stop hook timeout: 120s default (DEFAULT_STOP_HOOK_TIMEOUT_SECS)

Code Location: src/watchdog.rs:319-354

  • Overall timeout checked before prompt injection
  • Stop hook timeout checked after prompt injection

3. SIGTERM → SIGKILL with Descendants

Implementation: src/session.rs:398-419

fn kill_child(pid: nix::unistd::Pid) {
    let _ = nix::sys::signal::kill(pid, nix::sys::signal::Signal::SIGTERM);
    
    let deadline = Instant::now() + Duration::from_secs(2);
    loop {
        match nix::sys::wait::waitpid(pid, Some(WaitPidFlag::WNOHANG)) {
            Ok(WaitStatus::StillAlive) => {
                if Instant::now() >= deadline {
                    let _ = nix::sys::signal::kill(pid, nix::sys::signal::Signal::SIGKILL);
                    let _ = nix::sys::wait::waitpid(pid, None);
                    return;
                }
                thread::sleep(Duration::from_millis(50));
            }
            _ => return,
        }
    }
}

Process Group Handling: The child is spawned in its own process group via pty::fork(), ensuring SIGTERM/SIGKILL affects the entire descendant tree.

4. Clear Diagnostics

Implementation: src/session.rs:322-328

if watchdog_state.has_timeout_fired() {
    let timeout_type = watchdog_state.get_timeout_type().unwrap_or(TimeoutType::OverallTimeout);
    let timeout_msg = timeout_type.description();
    
    eprintln!("claude-print: {}", timeout_msg);
    eprintln!("claude-print: sending SIGTERM to child pid {}", spawner.child_pid);
    
    kill_child(spawner.child_pid);
    return Err(Error::Timeout(timeout_msg.to_string()));
}

Timeout Descriptions (src/watchdog.rs:46-55):

  • PtyFirstOutput: "child produced no PTY output within deadline (process may be hung at startup)"
  • StreamJsonFirstOutput: "child produced no stream-json output within deadline (process may be hung during session initialization)"
  • OverallTimeout: "session exceeded overall time deadline"
  • StopHookTimeout: "Stop hook did not fire within deadline after prompt injection (child may have hung during tool use or model inference)"

5. Tear Down Temp Resources

Implementation: src/session.rs:156-158

let _cleanup_guard = CleanupGuard(&installer);

The CleanupGuard ensures temp directory removal on all exit paths (normal, timeout, panic, signal). Verification in tests/watchdog.rs:96-100 asserts no orphaned temp directories remain.

6. Exit Non-Zero (124)

Implementation: src/main.rs:202-212

Err(Error::Timeout(_msg)) => {
    let _ = emit_error(
        &mut stdout,
        &mut stderr,
        &ClaudePrintError::Timeout,
        &cli.output_format,
        &resolve_claude_version(cli.claude_binary.as_deref()).unwrap_or_else(|| "unknown".to_string()),
        true,
    );
    exit_with_cleanup(ClaudePrintError::Timeout.exit_code()); // Returns 124
}

Exit Code Definition: src/error.rs:95-115

/// Timeout - operation exceeded deadline (exit 124, matching GNU timeout).
Timeout,

pub fn exit_code(&self) -> i32 {
    match self {
        ClaudePrintError::Timeout => 124,
        // ...
    }
}

Additional Features

Self-Pipe Signaling

Implementation: src/watchdog.rs:254-255, 292-297 The watchdog thread writes to the self-pipe on timeout, immediately waking the event loop from poll() without waiting for the 50ms timer tick.

Stream-JSON Monitoring

Implementation: src/watchdog.rs:376-424 Background thread monitors <temp_dir>/transcript.jsonl for stream-json output, setting the stream_json_output_received flag when valid JSON is detected.

Comprehensive Tests

Test File: tests/watchdog.rs

  • watchdog_silent_child_times_out_with_cleanup: Verifies timeout with 2s deadline, cleanup, no orphans
  • watchdog_one_second_timeout_fires_cleanly: Verifies short timeout (1s) fires correctly

Conclusion

All requirements from bead bf-2f5 have been fully implemented and verified:

  1. No-output timeout (PTY and stream-json)
  2. Max-turn timeout (overall and stop hook)
  3. SIGTERM → SIGKILL child and descendants
  4. Clear diagnostics to stderr
  5. Temp resource teardown
  6. Exit non-zero (124)

The implementation prevents indefinite hangs by ensuring the event loop is always interrupted on timeout, the child process is forcefully terminated, and the caller receives a non-zero exit code for clean retry logic.