pdftract/notes/pdftract-375xa.md
jedarden 6cf2d603ca feat(pdftract-375xa): implement cache key construction
Implement Phase 6.9.2: cache key construction from (PDF fingerprint,
extraction options) pairs. The key is (fingerprint, opts_hash) where
opts_hash is SHA-256 of canonical JSON serialization.

Key features:
- BTreeMap-based canonicalization for sorted keys
- Float canonicalization (preserves integers, canonicalizes floats)
- extraction_version included for cache invalidation on upgrades
- Forward-compatible with future ExtractionOptions fields

Acceptance criteria:
- Same effective values → same hash
- Toggle receipts off→lite → hash differs
- Different version → hash differs
- Sorted-key canonical JSON
- Float canonical (0.5 == 0.500)
- Documented invariant

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-23 04:50:33 -04:00

57 lines
3.1 KiB
Markdown

# pdftract-375xa: Cache Key Construction
## Summary
Implemented Phase 6.9.2: Cache key construction for (PDF fingerprint, extraction options) pairs. The key is a tuple (fingerprint, opts_hash) where opts_hash is the SHA-256 of the canonical JSON serialization of ExtractionOptions.
## Changes Made
### File: `crates/pdftract-core/src/cache/key.rs`
1. **Enhanced canonicalization implementation**:
- Replaced struct-based serialization with `BTreeMap`-based approach
- Added `canonical_json()` helper for testing sorted-key canonicalization
- Added `canonical_json_value()` for recursive canonicalization
2. **Key invariants implemented**:
- Keys are sorted lexicographically using `BTreeMap`
- Floats have canonical representation (preserves integers, canonicalizes floats)
- Booleans are always `true`/`false` (handled by serde_json)
- `extraction_version` is included for cache invalidation on upgrades
3. **Added comprehensive tests**:
- `test_acceptance_same_effective_values_same_hash` - AC for identical values
- `test_acceptance_receipts_off_to_lite_changes_hash` - AC for receipts toggle
- `test_acceptance_different_version_changes_hash` - AC for version pinning
- `test_acceptance_sorted_key_canonical` - AC for sorted keys
- `test_acceptance_float_canonical` - AC for float canonicalization
- `test_acceptance_float_canonical_edge_cases` - Edge cases for floats
- `test_invariant_documented` - Meta-test documenting the invariant
- `test_canonical_json_nested_objects` - Nested object sorting
- `test_canonical_json_arrays` - Array handling
- `test_canonical_json_float_arrays` - Float array handling
- `test_canonical_json_mixed` - Mixed nested structures
## Acceptance Criteria Status
| Criterion | Status | Notes |
|-----------|--------|-------|
| Same effective values → same hash | ✅ PASS | `test_acceptance_same_effective_values_same_hash` |
| Toggle receipts off→lite → hash differs | ✅ PASS | `test_acceptance_receipts_off_to_lite_changes_hash` |
| Different version → hash differs | ✅ PASS | `test_acceptance_different_version_changes_hash` |
| Sorted-key canonical | ✅ PASS | `test_acceptance_sorted_key_canonical` |
| Float canonical (0.5 == 0.500) | ✅ PASS | `test_acceptance_float_canonical` |
| Documented invariant | ✅ PASS | `test_invariant_documented` |
## Future Considerations
1. **OCR field** - When `ocr` field is added to ExtractionOptions, it will automatically be included in the canonical JSON
2. **Password field** - When added, should use a stable token (e.g., `password_set: bool`) instead of the literal password to avoid leaking sensitive data in cache directory entries
3. **Option\<T\> fields** - The canonicalization already handles defaults correctly; None and Some(default) will produce the same hash if the default-filling is done before canonicalization
## Implementation Notes
- Used `BTreeMap` for guaranteed lexicographic key ordering
- Integer representation is preserved (not converted to float)
- Float canonicalization is handled by serde_json's default behavior (shortest decimal representation)
- The implementation is forward-compatible with new fields added to ExtractionOptions