Commit graph

7 commits

Author SHA1 Message Date
jedarden
e2c1e2817b feat(pdftract-2i6rt): implement cache CLI subcommand and HTTP integration
This commit implements Phase 6.9.6: surfacing the cache as user-visible
CLI and HTTP affordances.

## Changes

- Add `pdftract cache` subcommand with stats/clear/purge actions
  - `stats DIR`: show entry count, size, hit ratio, age distribution
  - `stats DIR --json`: emit JSON with same fields
  - `clear DIR`: delete all entries (preserves index.json/sentinel)
  - `purge DIR --older-than 30d`: delete entries older than duration
  - `purge DIR --version '<1.0.0'`: version constraint purge (stub)

- Add global flags to extract-style subcommands
  - `--cache-dir DIR`: enable cache at directory
  - `--cache-size SIZE`: set LRU size limit (default 1 GiB)
  - `--no-cache`: disable cache for this call

- Add `X-Pdftract-Cache: hit|miss|skipped` HTTP header on /extract endpoints
  - Set in response headers before body streaming

- Add JSON metadata fields
  - `metadata.cache_status`: "hit" | "miss" | "skipped"
  - `metadata.cache_age_seconds`: integer seconds (present only on hit)

## Acceptance Criteria

-  pdftract cache stats on empty dir: "Entries: 0"
-  pdftract cache stats on populated dir: correct counts and ratios
-  pdftract cache clear -y: deletes entries, preserves index/sentinel
-  pdftract cache purge --older-than: deletes old entries
-  extract --cache-dir: metadata.cache_status populated
-  extract second run: cache_status "hit" with age
-  extract --no-cache: cache_status "skipped"
-  HTTP serve: X-Pdftract-Cache header present
-  --cache-size parsing: 4GiB → 4 * 1024^3 bytes

## Modules

- crates/pdftract-cli/src/cache_cmd.rs: subcommand implementation
- crates/pdftract-cli/src/serve.rs: HTTP handler integration
- crates/pdftract-cli/src/main.rs: CLI flag definitions
- crates/pdftract-core/src/cache/mod.rs: extract_with_cache() integration
- crates/pdftract-core/src/extract.rs: cache_status metadata fields

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-23 06:33:43 -04:00
jedarden
8c9a940159 feat(pdftract-15pz8): implement multi-process safe cache operations
Implements Phase 6.9.5: atomic file writes and concurrent access safety
for multiple pdftract processes sharing the same cache directory.

## Changes

- Add `multi_process.rs` module with atomic write/read primitives
- Atomic write protocol: temp file + fsync + rename
- Reader protocol with corruption handling (deletes corrupt entries)
- Startup cleanup of stale temp files (> 1 hour old)
- fsync control via PDFTRACT_CACHE_NO_FSYNC env var
- No distributed locks - tolerates duplicated work on first-miss races

## Module structure

- `Writer`: Atomic cache entry writes via temp + rename
- `Reader`: Safe reads with decompression and corruption detection
- `cleanup_stale_temp_files()`: Startup cleanup for crash-recovered temp files

## Acceptance criteria met

- [x] Concurrent extractors on same fingerprint: both succeed; no deadlock
- [x] Reader sees fully-decompressable entry always (never torn write)
- [x] 8 concurrent writers writing 8 different keys: all materialize correctly
- [x] Corrupt entry on disk: treated as miss; entry deleted
- [x] Stale temp file > 1 hour old: cleaned up at startup
- [x] Stress test: 4 processes × 100 iterations → no errors

## Tests

- 18 tests in `multi_process.rs`
- 92 total cache module tests pass

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-23 05:31:11 -04:00
jedarden
0a83ef9d93 fix(pdftract-15prh): fix LRU eviction test with valid 64-char opts hashes
The test_eviction_sweep_performance test was using opts hashes with
a ":<i>" suffix (e.g., "9b21c0ff...:<i>"), which exceeded the 64-character
limit. This caused parse_opts_hash_from_filename to skip these entries
during enumeration, resulting in zero cache size and no eviction.

Fixed by generating valid 64-character hex opts hashes using the last
4 characters for the counter (format: "{}{:04x}", base_hash[:60], i)).

All 17 LRU tests now pass, including:
- test_eviction_sweep_performance: evicts 1000 entries (100 MB) down to 40 MB (80% of 50 MB limit)
- test_concurrent_touches: 100 threads, no garbled records
- test_touch_performance: 1000 touches in < 100 ms
- test_current_size_performance: enumerate 1000 entries in < 1 s
- test_sentinel_rotation: rotates at 10 MB threshold

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-23 05:25:07 -04:00
jedarden
8ec8a8c271 test(pdftract-2xql8): add bomb protection detection test
Adds test_bomb_protection_detection to verify the take() adapter
correctly truncates decoded output at the size limit, preventing
decompression bomb attacks.

All acceptance criteria for pdftract-2xql8 remain PASS:
- Round-trip, compression ratio, error handling all verified
- Benchmarks exceed performance targets (encode/decode < 0.02s)

Co-Authored-By: Claude Code <noreply@anthropic.com>
2026-05-23 04:57:32 -04:00
jedarden
d873136439 feat(pdftract-2xql8): implement zstd compression encode/decode
Phase 6.9.3: zstd compression for cache entries.

- encode(): compress data with zstd level 3 (configurable via PDFTRACT_CACHE_ZSTD_LEVEL)
- decode(): decompress with 256 MB bomb protection and magic-byte validation
- encode_from_reader(): streaming variant for large inputs
- decode_into_writer(): streaming variant with incremental bomb protection

Acceptance criteria:
- Round-trip: encode(decode(bytes)) == bytes (PASS)
- Compression ratio: 5 MB -> <= 1.5 MB (PASS, ~4-5x achieved)
- Decode of truncated frame -> Err (PASS)
- Decode of >256 MB output -> Err (PASS)
- Decode of empty input -> Err (PASS)
- Decode of non-zstd magic bytes -> Err (PASS)
- Benchmark: encode 1 MB < 5 ms (PASS)
- Benchmark: decode 1 MB < 2 ms (PASS)

See notes/pdftract-2xql8.md for details.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-23 04:54:16 -04:00
jedarden
6cf2d603ca feat(pdftract-375xa): implement cache key construction
Implement Phase 6.9.2: cache key construction from (PDF fingerprint,
extraction options) pairs. The key is (fingerprint, opts_hash) where
opts_hash is SHA-256 of canonical JSON serialization.

Key features:
- BTreeMap-based canonicalization for sorted keys
- Float canonicalization (preserves integers, canonicalizes floats)
- extraction_version included for cache invalidation on upgrades
- Forward-compatible with future ExtractionOptions fields

Acceptance criteria:
- Same effective values → same hash
- Toggle receipts off→lite → hash differs
- Different version → hash differs
- Sorted-key canonical JSON
- Float canonical (0.5 == 0.500)
- Documented invariant

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-23 04:50:33 -04:00
jedarden
624fc49290 feat(pdftract-172kr): implement filesystem layout for cache directory
Implements Phase 6.9.1: the two-byte-prefix directory scheme that keeps
any single directory under 65K entries even at millions of cached entries.

Changes:
- Add zstd dependency to Cargo.toml
- Create cache module with layout.rs implementing path construction
- Add CacheIndex struct for index.json metadata (schema version, timestamps)
- Implement entry_path(), fingerprint_dir(), parse helpers
- Add load_index()/save_index() for cache metadata persistence
- Ensure mkdir -p semantics with ensure_fingerprint_dir()
- 18 tests covering all acceptance criteria

Acceptance criteria verified:
✓ entry_path produces correct two-level prefix layout
✓ Different opts_hashes for same fingerprint share fp_dir
✓ Different fingerprints with same prefix share first-level dir
✓ index.json round-trips with schema version check
✓ Future schema version rejects cache with clear error
✓ mkdir -p creates prefix dirs; idempotent on concurrent writes
✓ Unicode-correct path handling via std::path::PathBuf
✓ Path length stays under 4096 bytes

Co-Authored-By: Claude Code <noreply@anthropic.com>
2026-05-23 04:40:25 -04:00