Commit graph

9 commits

Author SHA1 Message Date
jedarden
1c6f26ecaa fix(bf-4mkhv): clean up unused imports in hash.rs
The bead description mentioned compile errors in hash.rs from API drift,
but those errors were either already fixed or misattributed. The API usage
was already correct:
- compute_fingerprint already takes 3 arguments with source
- len() already propagates Result with ?
- read_at method already used correctly
- Catalog fields accessed via trailer correctly

Only cleanup: removed unused std::fs::File and std::io imports.

Verification: notes/bf-4mkhv.md
2026-06-01 09:43:48 -04:00
jedarden
895f1ce43d fix(bf-1avnz): remove .code field access on String diagnostics in serve.rs
Fix two compilation errors at lines 584 and 658 where code was calling
.code on &String diagnostics. Replaced d.code.to_string() with direct
Vec<String> clone since diagnostics is already Vec<String>.

Accepts criteria:
- cargo check -p pdftract-cli emits no 'no field code' errors
- serve.rs compiles cleanly
2026-06-01 04:14:05 -04:00
jedarden
62a36ea756 docs(pdftract-3eohy): add rustdoc examples to Glyph and Span types
- Add worked example to Glyph struct showing all 11 fields
- Add worked example to Span struct showing all 10 fields
- Examples use rust,no_run for internal dependencies
- cargo doc passes with docs.rs feature set
- Verification note added at notes/pdftract-3eohy.md

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-01 01:16:24 -04:00
jedarden
f85e5149dd feat(pdftract-91e1i): HTTP fetch sequence implementation
Implement orchestration layer connecting HttpRangeSource to Phase 1.3
xref resolver and Phase 1.4 document model for remote PDF access:

- Document::open_remote() public API for remote PDF loading
- Progressive tail fetch (16 KB → 1 MB) for startxref location
- Xref forward-scan disabled for remote sources (via is_remote check)
- Page-by-page on-demand fetch via HttpRangeSource caching
- Resource lazy load through XrefResolver cache
- HEAD probe with 405 fallback, no Content-Length handling

Acceptance criteria:
 open_remote(url) returns Document with correct page count
 HEAD failure modes (405, no Content-Length, 401) handled
 xref forward-scan disabled for remote (is_remote check)
 Page-by-page on-demand fetch (HttpRangeSource LRU cache)
 INV-8 maintained (all errors return Result)

Files modified:
- crates/pdftract-core/src/document.rs (Document::open_remote, from_source)
- crates/pdftract-core/src/remote.rs (progressive tail fetch)
- crates/pdftract-core/src/lib.rs (re-exports)

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-28 13:17:00 -04:00
jedarden
7ffb1a729f fix(pdftract-63ka2): AES-128 test buffer allocation for PKCS#7 padding
The encrypt_padded_mut API requires the buffer to be large enough to
hold the padded ciphertext. The tests were using plaintext.to_vec() which
only allocated plaintext.len() bytes, insufficient for padding.

Changed pattern:
- Before: plaintext.to_vec() (insufficient space)
- After: vec![0u8; plaintext.len() + 16] with copy_from_slice

Also fixed incorrect usage: encrypt_padded_mut returns Result<(), Error>,
not a length. Use data_copy.len() directly for ciphertext length.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-28 01:30:33 -04:00
jedarden
b0c103b44f feat(pdftract-5boxq): implement audit-log FILE flag with NDJSON writer + middleware
Implements the --audit-log FILE flag on serve, mcp --bind, and inspect subcommands.
Emits per-request NDJSON audit lines with ts, client_ip, tool, fingerprint, duration_ms,
status, and diagnostics fields. The AuditLogWriter wraps a BufWriter<File> behind a Mutex
and flushes after each line for crash safety.

Core changes:
- Added pdftract-core/src/audit.rs with AuditRecord schema and AuditLogWriter
- Added chrono dependency to pdftract-core/Cargo.toml for timestamp generation
- Added crates/pdftract-cli/src/middleware/audit.rs with axum middleware
- Integrated AuditState into ServeState, McpServerState, and InspectorState
- Added --audit-log flag to Serve, Mcp, and InspectArgs CLI structures
- Stdio MCP mode: audit goes to stderr (not stdout, which is JSON-RPC)

Acceptance criteria:
- pdftract serve --audit-log /var/log/pdftract.ndjson → per-request NDJSON lines appear
- Each line is single-line valid JSON (no embedded newlines in values)
- client_ip captured from X-Real-IP or X-Forwarded-For header
- Stdio MCP audit goes to stderr (with --audit-log /dev/stderr or implicitly)
- Concurrent requests: writes don't interleave (Mutex ensures atomic line writes)
- Crash mid-request: log line either fully present or fully absent (BufWriter flushes after each write)

Closes: pdftract-5boxq
2026-05-25 05:14:06 -04:00
jedarden
7a70bb82b8 feat(pdftract-ixzbg): implement regex engine wiring for grep subcommand
Implement bead 7.8.2: Build the per-search matcher from GrepArgs.
Compile PATTERN into either a literal Aho-Corasick automaton (-F mode,
default) or a regex::Regex (-E mode). Apply -i (case-insensitive) and
-w (word-boundary) wrappers. Provide a uniform Matcher::find_iter(text)
-> Iter<MatchRange> API used by the per-span matcher.

Key changes:
- Add aho-corasick dependency for fast literal matching
- Create grep/matcher.rs with MatchRange and Matcher enum
- Reorganize grep.rs -> grep/mod.rs for proper module structure
- Implement literal mode with Aho-Corasick automaton
- Implement regex mode with regex::Regex
- Support case-insensitive matching in both modes
- Support word-boundary matching (\b anchors for regex, post-match check for literal)
- Comprehensive unit tests for all modes and edge cases

Closes: pdftract-ixzbg
2026-05-24 06:30:02 -04:00
jedarden
25f1081d7d feat(pdftract-p4vzu): implement inspector render_spans layer
Implements the span layer renderer for the inspector debug viewer.
Renders SVG outline rectangles for each text span, color-coded by
extraction confidence. Red (< 0.5), yellow (0.5-0.8), and green (> 0.8)
indicate low, medium, and high confidence respectively. Gray indicates
direct extraction without OCR.

Each rect includes data-* attributes for tooltip and click consumption:
- data-text: the extracted text content (XML-escaped)
- data-confidence: confidence score or empty string
- data-font: font name (XML-escaped)
- data-size: font size in points

All 10 unit tests pass. The implementation follows the existing SVG
generation pattern in pdftract-core/src/receipts/svg.rs.

Closes: pdftract-p4vzu
2026-05-24 03:11:34 -04:00
jedarden
7833d8c514 feat(pdftract-1rami): implement MCP tool catalog with 10 tools
Implement the MCP tool catalog for pdftract with all 10 tools wired to
the extraction surface via the MCP protocol. The tool registry provides
typed argument schemas (JSON Schema via schemars), structured error
mapping (Rust errors → JSON-RPC error codes), and per-invocation
observability logging.

- Tool registry with Tool trait and 10 tool implementations
- JSON Schema input schemas for all tools (draft-07 compliant)
- Error code mapping: -32000 NOT_YET_IMPLEMENTED, -32001 PDF_ENCRYPTED,
  -32002 IO_ERROR, -32003 PATH_INVALID
- Observability logging: structured stderr log line per tools/call
- Integration tests: 10/11 pass (1 ignored for encrypted fixture)
- Registry unit tests: 23/23 pass

Tools implemented:
- extract, extract_text, extract_markdown (stubs pending Phase 6)
- search (stub pending Phase 6)
- get_metadata, hash (fully implemented, fast paths)
- get_table, get_form_fields, get_attachments, classify (stubs return
  NOT_YET_IMPLEMENTED per spec)

Acceptance criteria: 8/8 PASS (2 WARN for Phase 6 stubs)

Refs: pdftract-1rami
Co-Authored-By: Claude Code <noreply@anthropic.com>
2026-05-23 02:12:41 -04:00