Implement the Receipt struct and lite-mode JSON serialization for visual citation receipts. This provides cryptographic proof of provenance for extracted text. Changes: - Add Receipt struct with 6 fields (pdf_fingerprint, page_index, bbox, content_hash, extraction_version, svg_clip) - Implement Receipt::lite() constructor with NFC normalization - Integrate Receipt into SpanJson and BlockJson schemas - Add unicode-normalization and serde_json dependencies Acceptance criteria: - Receipt::lite() produces valid receipts with svg_clip=None - Lite mode JSON omits svg_clip key via skip_serializing_if - Content hash uses NFC normalization for cross-platform stability - Receipt wired into SpanJson and BlockJson types Note: 100 receipts aggregate size is ~27 KB (not 15 KB as planned). The 15 KB target is not achievable with required field sizes. Refs: pdftract-5zm86, Phase 6.8 Visual Citation Receipts (lines 2351-2417) Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
109 lines
4.3 KiB
Markdown
109 lines
4.3 KiB
Markdown
# pdftract-5zm86: Receipt struct + lite-mode serialization
|
|
|
|
## Summary
|
|
|
|
Implemented the Receipt struct and lite-mode JSON serialization for visual citation receipts. The implementation is complete with all required functionality and tests passing.
|
|
|
|
## Files Modified
|
|
|
|
- `crates/pdftract-core/src/receipts/mod.rs` - Receipt struct definition with all required fields
|
|
- `crates/pdftract-core/src/receipts/lite.rs` - Lite-mode receipt creation functions
|
|
- `crates/pdftract-core/src/schema/mod.rs` - Integration of Receipt into SpanJson and BlockJson
|
|
|
|
## Acceptance Criteria Status
|
|
|
|
### PASS
|
|
|
|
1. ✅ **Receipt::lite() produces valid receipt with svg_clip == None**
|
|
- Verified by `test_receipt_lite_creates_valid_receipt`
|
|
|
|
2. ✅ **Lite mode JSON omits svg_clip key**
|
|
- Verified by `test_receipt_lite_serializes_without_svg_clip`
|
|
- Uses `#[serde(skip_serializing_if = "Option::is_none")]`
|
|
|
|
3. ✅ **Content hash round-trips consistently**
|
|
- Verified by `test_content_hash_roundtrip`
|
|
|
|
4. ✅ **NFC normalization produces stable hash**
|
|
- Verified by `test_content_hash_nfc_normalization`
|
|
- Uses `unicode-normalization::UnicodeNormalization::nfc()`
|
|
|
|
5. ✅ **Different strings produce different hashes**
|
|
- Verified by `test_content_hash_different_strings`
|
|
|
|
6. ✅ **Receipt wired into SpanJson and BlockJson**
|
|
- `Option<Receipt>` field added with `skip_serializing_if`
|
|
- Verified by schema tests
|
|
|
|
7. ✅ **Documentation comments on each field**
|
|
- All fields have comprehensive doc comments explaining units, format, and purpose
|
|
|
|
### WARN
|
|
|
|
- **100 receipts aggregate size**: Plan criterion of ≤15 KB is not achievable with required fields
|
|
- Actual size: ~27 KB for 100 receipts embedded in document JSON
|
|
- Per-receipt minimum: 266 bytes (fingerprint: 75 bytes, content_hash: 71 bytes, bbox: ~30 bytes, other fields: ~30 bytes, JSON syntax: ~60 bytes)
|
|
- The 150-180 byte target in plan appears to be a planning error; the required field sizes make this impossible
|
|
- 27 KB is still reasonable for cryptographic provenance on 100 pages (~270 bytes per page)
|
|
|
|
## Implementation Details
|
|
|
|
### Receipt Struct
|
|
|
|
```rust
|
|
pub struct Receipt {
|
|
pub pdf_fingerprint: String, // "pdftract-v1:" + hex(SHA-256)
|
|
pub page_index: usize, // 0-based, matches Phase 6.1 schema
|
|
pub bbox: [f64; 4], // [x0, y0, x1, y1] in PDF points
|
|
pub content_hash: String, // "sha256:" + hex(SHA-256) of NFC-normalized text
|
|
pub extraction_version: String, // CARGO_PKG_VERSION at compile time
|
|
pub svg_clip: Option<String>, // None in lite mode
|
|
}
|
|
```
|
|
|
|
### Content Hash Computation
|
|
|
|
- Text is NFC-normalized before hashing using `unicode-normalization` crate
|
|
- Hash format: `"sha256:" + hex(SHA-256)` (71 bytes total)
|
|
- Ensures stability across platforms with different Unicode normalization (e.g., macOS HFS+/APFS)
|
|
|
|
### Constructors
|
|
|
|
- `Receipt::lite()` - Creates lite-mode receipt (svg_clip = None)
|
|
- `Receipt::with_svg()` - Creates SVG-mode receipt (used by Phase 6.8.2)
|
|
|
|
## Test Results
|
|
|
|
All 13 receipt tests and 8 schema tests pass:
|
|
|
|
```
|
|
receipts::tests::test_receipt_lite_creates_valid_receipt ... ok
|
|
receipts::tests::test_receipt_lite_serializes_without_svg_clip ... ok
|
|
receipts::tests::test_content_hash_format ... ok
|
|
receipts::tests::test_content_hash_roundtrip ... ok
|
|
receipts::tests::test_content_hash_nfc_normalization ... ok
|
|
receipts::tests::test_content_hash_different_strings ... ok
|
|
receipts::tests::test_content_hash_empty_string ... ok
|
|
receipts::tests::test_content_hash_unicode ... ok
|
|
receipts::tests::test_receipt_size_estimate ... ok
|
|
receipts::tests::test_receipt_with_svg_includes_svg_clip ... ok
|
|
receipts::lite::tests::test_lite_create ... ok
|
|
receipts::lite::tests::test_lite_size_benchmark ... ok
|
|
receipts::lite::tests::test_lite_no_svg_in_json ... ok
|
|
|
|
schema::tests::test_span_json_serialization ... ok
|
|
schema::tests::test_span_json_with_confidence ... ok
|
|
schema::tests::test_span_json_with_receipt ... ok
|
|
schema::tests::test_block_json_serialization ... ok
|
|
schema::tests::test_block_json_heading_with_level ... ok
|
|
schema::tests::test_block_json_with_receipt ... ok
|
|
schema::tests::test_receipt_not_in_json_when_none ... ok
|
|
schema::tests::test_schema_stability ... ok
|
|
```
|
|
|
|
## References
|
|
|
|
- Plan: Phase 6.8 Visual Citation Receipts (lines 2351-2417)
|
|
- INV-3: Deterministic Unicode resolution
|
|
- Phase 1.7: PDF fingerprint format
|
|
- Phase 6.1: SpanJson and BlockJson schemas
|