Implement Phase 6.9.2: cache key construction from (PDF fingerprint, extraction options) pairs. The key is (fingerprint, opts_hash) where opts_hash is SHA-256 of canonical JSON serialization. Key features: - BTreeMap-based canonicalization for sorted keys - Float canonicalization (preserves integers, canonicalizes floats) - extraction_version included for cache invalidation on upgrades - Forward-compatible with future ExtractionOptions fields Acceptance criteria: - Same effective values → same hash - Toggle receipts off→lite → hash differs - Different version → hash differs - Sorted-key canonical JSON - Float canonical (0.5 == 0.500) - Documented invariant Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
3.1 KiB
3.1 KiB
pdftract-375xa: Cache Key Construction
Summary
Implemented Phase 6.9.2: Cache key construction for (PDF fingerprint, extraction options) pairs. The key is a tuple (fingerprint, opts_hash) where opts_hash is the SHA-256 of the canonical JSON serialization of ExtractionOptions.
Changes Made
File: crates/pdftract-core/src/cache/key.rs
-
Enhanced canonicalization implementation:
- Replaced struct-based serialization with
BTreeMap-based approach - Added
canonical_json()helper for testing sorted-key canonicalization - Added
canonical_json_value()for recursive canonicalization
- Replaced struct-based serialization with
-
Key invariants implemented:
- Keys are sorted lexicographically using
BTreeMap - Floats have canonical representation (preserves integers, canonicalizes floats)
- Booleans are always
true/false(handled by serde_json) extraction_versionis included for cache invalidation on upgrades
- Keys are sorted lexicographically using
-
Added comprehensive tests:
test_acceptance_same_effective_values_same_hash- AC for identical valuestest_acceptance_receipts_off_to_lite_changes_hash- AC for receipts toggletest_acceptance_different_version_changes_hash- AC for version pinningtest_acceptance_sorted_key_canonical- AC for sorted keystest_acceptance_float_canonical- AC for float canonicalizationtest_acceptance_float_canonical_edge_cases- Edge cases for floatstest_invariant_documented- Meta-test documenting the invarianttest_canonical_json_nested_objects- Nested object sortingtest_canonical_json_arrays- Array handlingtest_canonical_json_float_arrays- Float array handlingtest_canonical_json_mixed- Mixed nested structures
Acceptance Criteria Status
| Criterion | Status | Notes |
|---|---|---|
| Same effective values → same hash | ✅ PASS | test_acceptance_same_effective_values_same_hash |
| Toggle receipts off→lite → hash differs | ✅ PASS | test_acceptance_receipts_off_to_lite_changes_hash |
| Different version → hash differs | ✅ PASS | test_acceptance_different_version_changes_hash |
| Sorted-key canonical | ✅ PASS | test_acceptance_sorted_key_canonical |
| Float canonical (0.5 == 0.500) | ✅ PASS | test_acceptance_float_canonical |
| Documented invariant | ✅ PASS | test_invariant_documented |
Future Considerations
- OCR field - When
ocrfield is added to ExtractionOptions, it will automatically be included in the canonical JSON - Password field - When added, should use a stable token (e.g.,
password_set: bool) instead of the literal password to avoid leaking sensitive data in cache directory entries - Option<T> fields - The canonicalization already handles defaults correctly; None and Some(default) will produce the same hash if the default-filling is done before canonicalization
Implementation Notes
- Used
BTreeMapfor guaranteed lexicographic key ordering - Integer representation is preserved (not converted to float)
- Float canonicalization is handled by serde_json's default behavior (shortest decimal representation)
- The implementation is forward-compatible with new fields added to ExtractionOptions