jedarden/pdftract

Author	SHA1	Message	Date
jedarden	e2c1e2817b	feat(pdftract-2i6rt): implement cache CLI subcommand and HTTP integration This commit implements Phase 6.9.6: surfacing the cache as user-visible CLI and HTTP affordances. ## Changes - Add `pdftract cache` subcommand with stats/clear/purge actions - `stats DIR`: show entry count, size, hit ratio, age distribution - `stats DIR --json`: emit JSON with same fields - `clear DIR`: delete all entries (preserves index.json/sentinel) - `purge DIR --older-than 30d`: delete entries older than duration - `purge DIR --version '<1.0.0'`: version constraint purge (stub) - Add global flags to extract-style subcommands - `--cache-dir DIR`: enable cache at directory - `--cache-size SIZE`: set LRU size limit (default 1 GiB) - `--no-cache`: disable cache for this call - Add `X-Pdftract-Cache: hit\|miss\|skipped` HTTP header on /extract endpoints - Set in response headers before body streaming - Add JSON metadata fields - `metadata.cache_status`: "hit" \| "miss" \| "skipped" - `metadata.cache_age_seconds`: integer seconds (present only on hit) ## Acceptance Criteria - ✅ pdftract cache stats on empty dir: "Entries: 0" - ✅ pdftract cache stats on populated dir: correct counts and ratios - ✅ pdftract cache clear -y: deletes entries, preserves index/sentinel - ✅ pdftract cache purge --older-than: deletes old entries - ✅ extract --cache-dir: metadata.cache_status populated - ✅ extract second run: cache_status "hit" with age - ✅ extract --no-cache: cache_status "skipped" - ✅ HTTP serve: X-Pdftract-Cache header present - ✅ --cache-size parsing: 4GiB → 4 * 1024^3 bytes ## Modules - crates/pdftract-cli/src/cache_cmd.rs: subcommand implementation - crates/pdftract-cli/src/serve.rs: HTTP handler integration - crates/pdftract-cli/src/main.rs: CLI flag definitions - crates/pdftract-core/src/cache/mod.rs: extract_with_cache() integration - crates/pdftract-core/src/extract.rs: cache_status metadata fields Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-23 06:33:43 -04:00
jedarden	8c9a940159	feat(pdftract-15pz8): implement multi-process safe cache operations Implements Phase 6.9.5: atomic file writes and concurrent access safety for multiple pdftract processes sharing the same cache directory. ## Changes - Add `multi_process.rs` module with atomic write/read primitives - Atomic write protocol: temp file + fsync + rename - Reader protocol with corruption handling (deletes corrupt entries) - Startup cleanup of stale temp files (> 1 hour old) - fsync control via PDFTRACT_CACHE_NO_FSYNC env var - No distributed locks - tolerates duplicated work on first-miss races ## Module structure - `Writer`: Atomic cache entry writes via temp + rename - `Reader`: Safe reads with decompression and corruption detection - `cleanup_stale_temp_files()`: Startup cleanup for crash-recovered temp files ## Acceptance criteria met - [x] Concurrent extractors on same fingerprint: both succeed; no deadlock - [x] Reader sees fully-decompressable entry always (never torn write) - [x] 8 concurrent writers writing 8 different keys: all materialize correctly - [x] Corrupt entry on disk: treated as miss; entry deleted - [x] Stale temp file > 1 hour old: cleaned up at startup - [x] Stress test: 4 processes × 100 iterations → no errors ## Tests - 18 tests in `multi_process.rs` - 92 total cache module tests pass Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-23 05:31:11 -04:00
jedarden	0a83ef9d93	fix(pdftract-15prh): fix LRU eviction test with valid 64-char opts hashes The test_eviction_sweep_performance test was using opts hashes with a ":<i>" suffix (e.g., "9b21c0ff...:<i>"), which exceeded the 64-character limit. This caused parse_opts_hash_from_filename to skip these entries during enumeration, resulting in zero cache size and no eviction. Fixed by generating valid 64-character hex opts hashes using the last 4 characters for the counter (format: "{}{:04x}", base_hash[:60], i)). All 17 LRU tests now pass, including: - test_eviction_sweep_performance: evicts 1000 entries (100 MB) down to 40 MB (80% of 50 MB limit) - test_concurrent_touches: 100 threads, no garbled records - test_touch_performance: 1000 touches in < 100 ms - test_current_size_performance: enumerate 1000 entries in < 1 s - test_sentinel_rotation: rotates at 10 MB threshold Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-23 05:25:07 -04:00
jedarden	8ec8a8c271	test(pdftract-2xql8): add bomb protection detection test Adds test_bomb_protection_detection to verify the take() adapter correctly truncates decoded output at the size limit, preventing decompression bomb attacks. All acceptance criteria for pdftract-2xql8 remain PASS: - Round-trip, compression ratio, error handling all verified - Benchmarks exceed performance targets (encode/decode < 0.02s) Co-Authored-By: Claude Code <noreply@anthropic.com>	2026-05-23 04:57:32 -04:00
jedarden	d873136439	feat(pdftract-2xql8): implement zstd compression encode/decode Phase 6.9.3: zstd compression for cache entries. - encode(): compress data with zstd level 3 (configurable via PDFTRACT_CACHE_ZSTD_LEVEL) - decode(): decompress with 256 MB bomb protection and magic-byte validation - encode_from_reader(): streaming variant for large inputs - decode_into_writer(): streaming variant with incremental bomb protection Acceptance criteria: - Round-trip: encode(decode(bytes)) == bytes (PASS) - Compression ratio: 5 MB -> <= 1.5 MB (PASS, ~4-5x achieved) - Decode of truncated frame -> Err (PASS) - Decode of >256 MB output -> Err (PASS) - Decode of empty input -> Err (PASS) - Decode of non-zstd magic bytes -> Err (PASS) - Benchmark: encode 1 MB < 5 ms (PASS) - Benchmark: decode 1 MB < 2 ms (PASS) See notes/pdftract-2xql8.md for details. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-23 04:54:16 -04:00
jedarden	6cf2d603ca	feat(pdftract-375xa): implement cache key construction Implement Phase 6.9.2: cache key construction from (PDF fingerprint, extraction options) pairs. The key is (fingerprint, opts_hash) where opts_hash is SHA-256 of canonical JSON serialization. Key features: - BTreeMap-based canonicalization for sorted keys - Float canonicalization (preserves integers, canonicalizes floats) - extraction_version included for cache invalidation on upgrades - Forward-compatible with future ExtractionOptions fields Acceptance criteria: - Same effective values → same hash - Toggle receipts off→lite → hash differs - Different version → hash differs - Sorted-key canonical JSON - Float canonical (0.5 == 0.500) - Documented invariant Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-23 04:50:33 -04:00
jedarden	624fc49290	feat(pdftract-172kr): implement filesystem layout for cache directory Implements Phase 6.9.1: the two-byte-prefix directory scheme that keeps any single directory under 65K entries even at millions of cached entries. Changes: - Add zstd dependency to Cargo.toml - Create cache module with layout.rs implementing path construction - Add CacheIndex struct for index.json metadata (schema version, timestamps) - Implement entry_path(), fingerprint_dir(), parse helpers - Add load_index()/save_index() for cache metadata persistence - Ensure mkdir -p semantics with ensure_fingerprint_dir() - 18 tests covering all acceptance criteria Acceptance criteria verified: ✓ entry_path produces correct two-level prefix layout ✓ Different opts_hashes for same fingerprint share fp_dir ✓ Different fingerprints with same prefix share first-level dir ✓ index.json round-trips with schema version check ✓ Future schema version rejects cache with clear error ✓ mkdir -p creates prefix dirs; idempotent on concurrent writes ✓ Unicode-correct path handling via std::path::PathBuf ✓ Path length stays under 4096 bytes Co-Authored-By: Claude Code <noreply@anthropic.com>	2026-05-23 04:40:25 -04:00

7 commits