diff --git a/notes/pdftract-4em4l.md b/notes/pdftract-4em4l.md index 608d063..a1d7997 100644 --- a/notes/pdftract-4em4l.md +++ b/notes/pdftract-4em4l.md @@ -2,121 +2,110 @@ ## Summary -The `--audit-log FILE` flag and audit logging infrastructure was already fully implemented in the codebase. This note documents the verification of the acceptance criteria. +The audit logging infrastructure is **fully implemented** across all public-listener subcommands. This verification confirms that all requirements from the task description are met. -## Acceptance Criteria Status +## Implementation Status -### ✅ --audit-log FILE flag implemented -- **serve**: `crates/pdftract-cli/src/main.rs:322` - AuditLogWriter passed to serve::run -- **mcp**: `crates/pdftract-cli/src/main.rs:385` - AuditLogWriter passed to mcp::run_stdio/run_http -- **inspect**: `crates/pdftract-cli/src/inspect/args.rs:44-62` - audit_log field on InspectArgs +### 1. Core Audit Infrastructure ✅ -### ✅ Per-request NDJSON line with all documented fields -**AuditRecord schema** (`crates/pdftract-core/src/audit.rs:33-52`): -- `ts`: ISO-8601 RFC3339 UTC timestamp -- `client_ip`: Option - HTTP peer or stdio (absent) -- `tool`: String - Tool name (extract, classify, mcp.extract, etc.) -- `fingerprint`: Option - PDF structural fingerprint (pdftract-v1:hex) -- `duration_ms`: u64 - Request duration in milliseconds -- `status`: u16 - HTTP-style status code (200 ok, 4xx client error, 5xx server error) -- `diagnostics`: Vec - Diagnostic codes (XREF_REPAIRED, STREAM_BOMB, etc.) +**File: `crates/pdftract-core/src/audit.rs`** +- `AuditLogWriter`: Thread-safe NDJSON writer with crash-safe writes + - Single atomic write per line (BufWriter with flush) + - Supports stdout (`-`), stderr (`/dev/stderr`), and file paths +- `AuditRecord`: Schema with all required fields: + - `ts`: ISO-8601 RFC3339 UTC timestamp + - `client_ip`: Optional string (absent for stdio MCP) + - `tool`: Tool name + - `fingerprint`: Structural fingerprint (pdftract-v1:hex form) + - `duration_ms`: Request duration in milliseconds + - `status`: HTTP-style status code (200, 4xx, 5xx) + - `diagnostics`: Vec of diagnostic code strings -### ✅ Stdio MCP requests OMIT the client_ip field -**Implementation**: `crates/pdftract-cli/src/mcp/stdio.rs:364,476` -```rust -None, // No client_ip for stdio mode -``` +### 2. NEVER-Log Policy Enforcement ✅ -### ✅ Log-policy enforcement (compile-time) -**CI Gate**: `.ci/scripts/check-log-policy.sh` -- Scans for credential patterns in log/println/eprintln calls -- Checks for password, token, secret, api_key patterns -- Checks for content leaks (body, content, text, data) -- Runs in CI to prevent violations from being merged +**File: `crates/pdftract-core/src/log_policy.rs`** +- `redact_log_line()`: Runtime redaction for free-form logs +- `redact_audit_log_line()`: Specialized redaction for audit logs (preserves JSON structure) +- `is_sensitive_header()`: Checks for sensitive headers +- `redact_header_value()`: Redacts header values +- Patterns detected: + - Passwords: `password:`, `pwd:`, `pass:`, `passwd:` + - Tokens: `token:`, `bearer`, `api_key`, `secret` + - Headers: Authorization, Cookie, Proxy-Authorization + - Base64-like strings (JWT tokens, API keys) + - Long words (>100 chars) truncated in free-form logs -### ✅ Log-policy enforcement (runtime) -**Runtime redaction mechanisms**: -1. **Password redaction**: `crates/pdftract-core/src/parser/stream.rs:3167` - Debug/Serialize redacts password -2. **Header redaction**: `crates/pdftract-cli/src/mcp/http.rs:641-657` - `redact_headers_for_log()` redacts Authorization/Cookie/Proxy-Authorization -3. **Panic hook redaction**: `crates/pdftract-cli/src/panic_hook.rs:58-79` - Redacts SecretString from backtraces -4. **Encryption debug redaction**: `crates/pdftract-core/src/encryption/aes_256.rs:423-427` - Debug redacts encryption keys +### 3. Audit Middleware ✅ -### ✅ TH-08 test for log policy -**Test**: `tests/security/TH-08-log-audit.rs` -- Runs extraction with `RUST_LOG=trace` (maximum verbosity) -- Captures stderr (log output) -- Verifies no sensitive content appears in logs -- Tests for password, bearer token, PDF bytes, and sensitive headers +**File: `crates/pdftract-cli/src/middleware/audit.rs`** +- `audit_middleware`: Tower middleware for axum + - Extracts client IP from peer address (default) or X-Forwarded-For (with --trust-forwarded-for) + - Records request start time + - Extracts tool name from path + - Stores `RequestMetadata` in extensions for handlers +- `AuditState`: Holds optional AuditLogWriter and trust_forwarded_for flag +- `RequestMetadata`: Start time, client_ip, tool -### ✅ Rotation policy documented in --help output -**Documentation**: `crates/pdftract-cli/src/main.rs:306-320` -```text -## Audit log rotation +### 4. CLI Flags ✅ -pdftract does NOT rotate the audit log. Operators MUST configure logrotate(8) -to manage log file size and retention. A typical logrotate configuration: -``` +**File: `crates/pdftract-cli/src/main.rs`** +- Serve subcommand (lines 304-309): `--audit-log FILE` with rotation documentation +- MCP subcommand (lines 354-359): Same flag and documentation +- Trust-forwarded-for flag (lines 311-313): `--trust-forwarded-for` flag -### ✅ Fingerprint logged, NOT path/URL -**Implementation**: -- `crates/pdftract-core/src/audit.rs:44` - `fingerprint: Option` stores structural fingerprint -- `crates/pdftract-cli/src/serve.rs:582` - Extracts fingerprint from result -- `crates/pdftract-cli/src/serve.rs:604` - Logs fingerprint instead of path +**File: `crates/pdftract-cli/src/inspect/args.rs`** +- Inspect subcommand (lines 44-49): Same audit-log flag with rotation documentation -### ✅ AuditLogWriter is crash-safe -**Implementation**: `crates/pdftract-core/src/audit.rs:135-143` -```rust -pub fn write_record(&self, record: &AuditRecord) -> Result<()> { - let json = serde_json::to_string(record)?; - let mut writer = self.writer.lock()?; - writeln!(writer, "{}", json)?; - writer.flush()?; // Flush after each line for crash safety - Ok(()) -} -``` +### 5. Serve Mode Integration ✅ -## Middleware Integration +**File: `crates/pdftract-cli/src/serve.rs`** +- Lines 70-82: Imports AuditLogWriter and AuditState +- Lines 109-111: ServeState includes audit field +- Lines 117-125: ServeState::new creates AuditState with trust_forwarded_for support +- Lines 386-394: Creates AuditLogWriter from --audit-log flag +- Lines 419-421: Applies audit_middleware layer +- Lines 599-609: Writes audit log after successful extraction +- Lines 678-688: Writes audit log for text extraction -### HTTP Serve -**File**: `crates/pdftract-cli/src/middleware/audit.rs` -- Extracts client IP from peer address or X-Forwarded-For -- Stores RequestMetadata in request extensions -- Handlers write audit logs after extraction completes +### 6. MCP Stdio Mode Integration ✅ -### MCP HTTP -**File**: `crates/pdftract-cli/src/mcp/http.rs:254-291` -- Writes audit log after handling POST requests -- Includes client_ip, duration_ms, status, diagnostics +**File: `crates/pdftract-cli/src/mcp/stdio.rs`** +- Lines 349-365: Writes audit log with `client_ip: None` (stdio mode has no peer address) -### MCP Stdio -**File**: `crates/pdftract-cli/src/mcp/stdio.rs:361-370` -- Writes audit log for tools/call requests -- Omits client_ip (stdio mode has no client) +### 7. MCP HTTP Mode Integration ✅ -### Inspect -**File**: `crates/pdftract-cli/src/inspect/inspect.rs:79-94` -- Creates AuditLogWriter from args.audit_log -- Stores in InspectorState with AuditState wrapper +**File: `crates/pdftract-cli/src/mcp/http.rs`** +- Lines 255-292: Writes audit log with client_ip (HTTP mode has peer address) -## NEVER-Log Policy +### 8. Test Coverage ✅ -The following are NEVER logged at any level: -- Password values (redacted via SecretString) -- Bearer tokens (redacted via header redaction) -- PDF bytes (not logged; only fingerprint) -- Extracted text content (not logged; only metadata) -- Cookie/Authorization/Proxy-Authorization headers (redacted) +**File: `tests/security/TH-08-log-audit.rs`** +- Tests that PDF content is never logged at trace level +- Tests that password values are never logged +- Tests that bearer tokens are never logged +- Tests that PDF bytes are never logged +- Tests that sensitive headers are never logged -## Test Results +## Acceptance Criteria Verification -**Audit module tests**: All 6 tests passed -- test_audit_record_new -- test_audit_record_with_client_ip -- test_audit_record_with_diagnostics -- test_audit_record_add_diagnostic -- test_audit_record_serialize -- test_audit_log_writer_memory +| Criterion | Status | Evidence | +|-----------|--------|----------| +| `--audit-log FILE` flag on serve, mcp, inspect | ✅ | main.rs:304-309, 354-359; inspect/args.rs:44-49 | +| Per-request NDJSON line with all fields | ✅ | audit.rs:40-58 (AuditRecord) | +| Stdio MCP omits client_ip field | ✅ | stdio.rs:357 (None for client_ip) | +| NEVER-log policy enforcement | ✅ | log_policy.rs:48-161 (redaction) | +| Rotation policy in --help | ✅ | main.rs:306-308, 356-358; inspect/args.rs:45-47 | +| Fingerprint logged, not path/URL | ✅ | audit.rs:50-51 (fingerprint field) | +| AuditLogWriter crash-safe | ✅ | audit.rs:102-154 (BufWriter + flush) | +| Trust-forwarded-for flag | ✅ | main.rs:311-313; middleware/audit.rs:99-121 | +| TH-08 test exists | ✅ | tests/security/TH-08-log-audit.rs | ## Conclusion -All acceptance criteria for pdftract-4em4l are satisfied. The audit logging infrastructure is fully implemented and tested. +All acceptance criteria are met. The audit logging infrastructure is complete and operational across: +- `pdftract serve` (HTTP mode) +- `pdftract mcp --stdio` (stdio mode) +- `pdftract mcp --bind` (HTTP+SSE mode) +- `pdftract inspect` (inspector mode) + +The NEVER-log policy is enforced at runtime via `redact_audit_log_line()`, and the TH-08 test verifies compliance.