diff --git a/notes/pdftract-22vzm.md b/notes/pdftract-22vzm.md new file mode 100644 index 0000000..ebd2c07 --- /dev/null +++ b/notes/pdftract-22vzm.md @@ -0,0 +1,101 @@ +# Phase 1.7: PDF Structural Fingerprint — Coordinator Close Summary + +## Overview +Phase 1.7 (PDF Structural Fingerprint) is complete. All 4 child beads have been closed, implementing a reproducible 256-bit content hash that identifies the semantic content of a PDF independent of metadata churn, byte ordering, and producer-tool re-saves. + +## Child Beads Closed + +1. **pdftract-q15sh** — Fingerprint algorithm implementation (Merkle SHA-256 over canonicalized inputs) + - Location: `crates/pdftract-core/src/fingerprint/mod.rs` + - See: `notes/pdftract-q15sh.md` + +2. **pdftract-154mz** — Per-page input canonicalization (geometry rounding, content stream normalization) + - Location: `crates/pdftract-core/src/fingerprint/canonicalize.rs` + - See: `notes/pdftract-154mz.md` + +3. **pdftract-3954u** — `pdftract hash` CLI subcommand with exit codes 0/2/3/4/5/6 + - Location: `crates/pdftract-cli/src/main.rs` + - See: `notes/pdftract-3954u.md` + +4. **pdftract-ef6xz** — Fingerprint reproducibility test corpus + - Location: `tests/fingerprint/fixtures/` + - See: `notes/pdftract-ef6xz.md` + +## Acceptance Criteria Verification + +### ✅ All 4 child beads closed +All child beads show Status: closed in `bf show`. + +### ✅ Algorithm v1 stable and documented +Documentation exists at `crates/pdftract-core/src/fingerprint/algorithm.md` (5.1 KB). +Algorithm version prefix is `pdftract-v1:` (INV-13 compliant). + +### ✅ All 5 Critical tests pass (73 total fingerprint tests) +Test run on 2025-06-03: +``` +cargo nextest run -j 4 --package pdftract-core fingerprint +Summary: 73 tests run: 73 passed, 3083 skipped +``` + +Critical tests verified: +- `test_fixture_acrobat_resave` — Acrobat re-saves produce identical fingerprint ✅ +- `test_fixture_pdftk_resave` — pdftk re-saves produce identical fingerprint ✅ +- `test_fixture_metadata_only` — CreationDate-only changes preserve fingerprint ✅ +- `test_fixture_content_edit_one_glyph` — Content edits change fingerprint ✅ +- `test_fixture_linearization_toggle` — Linearization toggle preserves fingerprint (KU-7) ✅ + +### ✅ INV-3 (byte-stable across runs) +Test: `test_inv3_reproducibility_100_invocations` +Verifies 100 invocations on the same PDF produce identical fingerprint output. +Result: PASS (0.011s runtime) + +### ✅ INV-13 (version prefix on every emission) +Tests: `test_inv13_fingerprint_format`, `test_inv13_multiple_outputs_match_format` +Verifies output matches regex `^pdftract-v1:[0-9a-f]{64}$` +Result: PASS + +### ✅ CLI command functional +```bash +$ ./target/release/pdftract hash tests/fingerprint/fixtures/acrobat_resave/v1.pdf +pdftract-v1:ab24a95f44ceca5d2aed4b6d056adddd8539f44c6cd6ca506534e830c82ea8a8 +``` + +## Implementation Summary + +### Algorithm (Merkle-style SHA-256) +Inputs in deterministic order: +1. Page count (u32, big-endian) +2. Per page (in page_index order): + a. SHA-256 of decoded, token-normalized content streams + b. SHA-256 of resolved resource dict (sorted by namespace/key) + c. Page geometry: MediaBox, CropBox, Rotate (4-decimal fixed-point) +3. Structure tree SHA-256 (if tagged; all-zero hash otherwise) +4. Catalog feature flag byte (encrypted, javascript, XFA, OCG bits) + +Deliberately excluded (per ADR-008): +- `/Producer`, `/Creator`, `/CreationDate`, `/ModDate`, `/Author`, `/Title`, `/Subject`, `/Keywords` +- `/ID` array, XMP `/Metadata` stream +- xref byte layout, object number assignment +- Inline whitespace (lexer-normalized before hashing) + +### CLI Interface +- Command: `pdftract hash ` (file path or URL) +- Output: `pdftract-v1:<64-hex-chars>\n` +- Exit codes: 0 success, 2 corrupt, 3 encrypted-no-password, 4 cannot read, 5 network failure, 6 TLS handshake failure + +## Downstream Dependencies + +This fingerprint is now available for: +- **Phase 6.8 receipts** — binding identity for receipt verification +- **Phase 6.9 cache** — content-addressed cache key for extraction results +- **Users** — identify PDFs across re-saves via `pdftract hash` + +## Related Plan References +- Plan: Phase 1.7 PDF Structural Fingerprint (lines 1182-1219) +- ADR-008 (fingerprint excludes metadata) +- INV-3, INV-13 (byte-stability, version prefix) +- KU-7 (linearization toggle test) + +## Git Commits +This coordinator bead did not introduce new code commits; its child beads each produced their own commits. +See child bead verification notes for individual commit details.