pdftract/notes/pdftract-22vzm.md
jedarden 492a2944ae docs(pdftract-22vzm): Add Phase 1.7 coordinator close summary
All 4 child beads closed:
- pdftract-q15sh: Fingerprint algorithm implementation
- pdftract-154mz: Per-page input canonicalization
- pdftract-3954u: pdftract hash CLI subcommand
- pdftract-ef6xz: Fingerprint reproducibility test corpus

Acceptance criteria verified:
- 73 fingerprint tests PASS (all 5 critical tests covered)
- INV-3 (100-invocation reproducibility): PASS
- INV-13 (version prefix format): PASS
- Algorithm documented: crates/pdftract-core/src/fingerprint/algorithm.md
- CLI functional: pdftract hash outputs pdftract-v1:<hex>

Closes pdftract-22vzm
2026-06-03 00:07:38 -04:00

4.3 KiB

Phase 1.7: PDF Structural Fingerprint — Coordinator Close Summary

Overview

Phase 1.7 (PDF Structural Fingerprint) is complete. All 4 child beads have been closed, implementing a reproducible 256-bit content hash that identifies the semantic content of a PDF independent of metadata churn, byte ordering, and producer-tool re-saves.

Child Beads Closed

  1. pdftract-q15sh — Fingerprint algorithm implementation (Merkle SHA-256 over canonicalized inputs)

    • Location: crates/pdftract-core/src/fingerprint/mod.rs
    • See: notes/pdftract-q15sh.md
  2. pdftract-154mz — Per-page input canonicalization (geometry rounding, content stream normalization)

    • Location: crates/pdftract-core/src/fingerprint/canonicalize.rs
    • See: notes/pdftract-154mz.md
  3. pdftract-3954updftract hash CLI subcommand with exit codes 0/2/3/4/5/6

    • Location: crates/pdftract-cli/src/main.rs
    • See: notes/pdftract-3954u.md
  4. pdftract-ef6xz — Fingerprint reproducibility test corpus

    • Location: tests/fingerprint/fixtures/
    • See: notes/pdftract-ef6xz.md

Acceptance Criteria Verification

All 4 child beads closed

All child beads show Status: closed in bf show.

Algorithm v1 stable and documented

Documentation exists at crates/pdftract-core/src/fingerprint/algorithm.md (5.1 KB). Algorithm version prefix is pdftract-v1: (INV-13 compliant).

All 5 Critical tests pass (73 total fingerprint tests)

Test run on 2025-06-03:

cargo nextest run -j 4 --package pdftract-core fingerprint
Summary: 73 tests run: 73 passed, 3083 skipped

Critical tests verified:

  • test_fixture_acrobat_resave — Acrobat re-saves produce identical fingerprint
  • test_fixture_pdftk_resave — pdftk re-saves produce identical fingerprint
  • test_fixture_metadata_only — CreationDate-only changes preserve fingerprint
  • test_fixture_content_edit_one_glyph — Content edits change fingerprint
  • test_fixture_linearization_toggle — Linearization toggle preserves fingerprint (KU-7)

INV-3 (byte-stable across runs)

Test: test_inv3_reproducibility_100_invocations Verifies 100 invocations on the same PDF produce identical fingerprint output. Result: PASS (0.011s runtime)

INV-13 (version prefix on every emission)

Tests: test_inv13_fingerprint_format, test_inv13_multiple_outputs_match_format Verifies output matches regex ^pdftract-v1:[0-9a-f]{64}$ Result: PASS

CLI command functional

$ ./target/release/pdftract hash tests/fingerprint/fixtures/acrobat_resave/v1.pdf
pdftract-v1:ab24a95f44ceca5d2aed4b6d056adddd8539f44c6cd6ca506534e830c82ea8a8

Implementation Summary

Algorithm (Merkle-style SHA-256)

Inputs in deterministic order:

  1. Page count (u32, big-endian)
  2. Per page (in page_index order): a. SHA-256 of decoded, token-normalized content streams b. SHA-256 of resolved resource dict (sorted by namespace/key) c. Page geometry: MediaBox, CropBox, Rotate (4-decimal fixed-point)
  3. Structure tree SHA-256 (if tagged; all-zero hash otherwise)
  4. Catalog feature flag byte (encrypted, javascript, XFA, OCG bits)

Deliberately excluded (per ADR-008):

  • /Producer, /Creator, /CreationDate, /ModDate, /Author, /Title, /Subject, /Keywords
  • /ID array, XMP /Metadata stream
  • xref byte layout, object number assignment
  • Inline whitespace (lexer-normalized before hashing)

CLI Interface

  • Command: pdftract hash <INPUT> (file path or URL)
  • Output: pdftract-v1:<64-hex-chars>\n
  • Exit codes: 0 success, 2 corrupt, 3 encrypted-no-password, 4 cannot read, 5 network failure, 6 TLS handshake failure

Downstream Dependencies

This fingerprint is now available for:

  • Phase 6.8 receipts — binding identity for receipt verification
  • Phase 6.9 cache — content-addressed cache key for extraction results
  • Users — identify PDFs across re-saves via pdftract hash
  • Plan: Phase 1.7 PDF Structural Fingerprint (lines 1182-1219)
  • ADR-008 (fingerprint excludes metadata)
  • INV-3, INV-13 (byte-stability, version prefix)
  • KU-7 (linearization toggle test)

Git Commits

This coordinator bead did not introduce new code commits; its child beads each produced their own commits. See child bead verification notes for individual commit details.