pdftract/notes/pdftract-5zm86.md
jedarden 9f18c6cb9c feat(pdftract-5zm86): implement Receipt struct + lite-mode serialization
Implement the Receipt struct and lite-mode JSON serialization for
visual citation receipts. This provides cryptographic proof of
provenance for extracted text.

Changes:
- Add Receipt struct with 6 fields (pdf_fingerprint, page_index,
  bbox, content_hash, extraction_version, svg_clip)
- Implement Receipt::lite() constructor with NFC normalization
- Integrate Receipt into SpanJson and BlockJson schemas
- Add unicode-normalization and serde_json dependencies

Acceptance criteria:
- Receipt::lite() produces valid receipts with svg_clip=None
- Lite mode JSON omits svg_clip key via skip_serializing_if
- Content hash uses NFC normalization for cross-platform stability
- Receipt wired into SpanJson and BlockJson types

Note: 100 receipts aggregate size is ~27 KB (not 15 KB as planned).
The 15 KB target is not achievable with required field sizes.

Refs: pdftract-5zm86, Phase 6.8 Visual Citation Receipts (lines 2351-2417)

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-23 03:30:24 -04:00

4.3 KiB

pdftract-5zm86: Receipt struct + lite-mode serialization

Summary

Implemented the Receipt struct and lite-mode JSON serialization for visual citation receipts. The implementation is complete with all required functionality and tests passing.

Files Modified

  • crates/pdftract-core/src/receipts/mod.rs - Receipt struct definition with all required fields
  • crates/pdftract-core/src/receipts/lite.rs - Lite-mode receipt creation functions
  • crates/pdftract-core/src/schema/mod.rs - Integration of Receipt into SpanJson and BlockJson

Acceptance Criteria Status

PASS

  1. Receipt::lite() produces valid receipt with svg_clip == None

    • Verified by test_receipt_lite_creates_valid_receipt
  2. Lite mode JSON omits svg_clip key

    • Verified by test_receipt_lite_serializes_without_svg_clip
    • Uses #[serde(skip_serializing_if = "Option::is_none")]
  3. Content hash round-trips consistently

    • Verified by test_content_hash_roundtrip
  4. NFC normalization produces stable hash

    • Verified by test_content_hash_nfc_normalization
    • Uses unicode-normalization::UnicodeNormalization::nfc()
  5. Different strings produce different hashes

    • Verified by test_content_hash_different_strings
  6. Receipt wired into SpanJson and BlockJson

    • Option<Receipt> field added with skip_serializing_if
    • Verified by schema tests
  7. Documentation comments on each field

    • All fields have comprehensive doc comments explaining units, format, and purpose

WARN

  • 100 receipts aggregate size: Plan criterion of ≤15 KB is not achievable with required fields
    • Actual size: ~27 KB for 100 receipts embedded in document JSON
    • Per-receipt minimum: 266 bytes (fingerprint: 75 bytes, content_hash: 71 bytes, bbox: ~30 bytes, other fields: ~30 bytes, JSON syntax: ~60 bytes)
    • The 150-180 byte target in plan appears to be a planning error; the required field sizes make this impossible
    • 27 KB is still reasonable for cryptographic provenance on 100 pages (~270 bytes per page)

Implementation Details

Receipt Struct

pub struct Receipt {
    pub pdf_fingerprint: String,     // "pdftract-v1:" + hex(SHA-256)
    pub page_index: usize,           // 0-based, matches Phase 6.1 schema
    pub bbox: [f64; 4],              // [x0, y0, x1, y1] in PDF points
    pub content_hash: String,        // "sha256:" + hex(SHA-256) of NFC-normalized text
    pub extraction_version: String,  // CARGO_PKG_VERSION at compile time
    pub svg_clip: Option<String>,    // None in lite mode
}

Content Hash Computation

  • Text is NFC-normalized before hashing using unicode-normalization crate
  • Hash format: "sha256:" + hex(SHA-256) (71 bytes total)
  • Ensures stability across platforms with different Unicode normalization (e.g., macOS HFS+/APFS)

Constructors

  • Receipt::lite() - Creates lite-mode receipt (svg_clip = None)
  • Receipt::with_svg() - Creates SVG-mode receipt (used by Phase 6.8.2)

Test Results

All 13 receipt tests and 8 schema tests pass:

receipts::tests::test_receipt_lite_creates_valid_receipt ... ok
receipts::tests::test_receipt_lite_serializes_without_svg_clip ... ok
receipts::tests::test_content_hash_format ... ok
receipts::tests::test_content_hash_roundtrip ... ok
receipts::tests::test_content_hash_nfc_normalization ... ok
receipts::tests::test_content_hash_different_strings ... ok
receipts::tests::test_content_hash_empty_string ... ok
receipts::tests::test_content_hash_unicode ... ok
receipts::tests::test_receipt_size_estimate ... ok
receipts::tests::test_receipt_with_svg_includes_svg_clip ... ok
receipts::lite::tests::test_lite_create ... ok
receipts::lite::tests::test_lite_size_benchmark ... ok
receipts::lite::tests::test_lite_no_svg_in_json ... ok

schema::tests::test_span_json_serialization ... ok
schema::tests::test_span_json_with_confidence ... ok
schema::tests::test_span_json_with_receipt ... ok
schema::tests::test_block_json_serialization ... ok
schema::tests::test_block_json_heading_with_level ... ok
schema::tests::test_block_json_with_receipt ... ok
schema::tests::test_receipt_not_in_json_when_none ... ok
schema::tests::test_schema_stability ... ok

References

  • Plan: Phase 6.8 Visual Citation Receipts (lines 2351-2417)
  • INV-3: Deterministic Unicode resolution
  • Phase 1.7: PDF fingerprint format
  • Phase 6.1: SpanJson and BlockJson schemas