docs(pdftract-22vzm): Add Phase 1.7 coordinator close summary

All 4 child beads closed:
- pdftract-q15sh: Fingerprint algorithm implementation
- pdftract-154mz: Per-page input canonicalization
- pdftract-3954u: pdftract hash CLI subcommand
- pdftract-ef6xz: Fingerprint reproducibility test corpus

Acceptance criteria verified:
- 73 fingerprint tests PASS (all 5 critical tests covered)
- INV-3 (100-invocation reproducibility): PASS
- INV-13 (version prefix format): PASS
- Algorithm documented: crates/pdftract-core/src/fingerprint/algorithm.md
- CLI functional: pdftract hash outputs pdftract-v1:<hex>

Closes pdftract-22vzm
This commit is contained in:
jedarden 2026-06-03 00:07:38 -04:00
parent e10919018c
commit 492a2944ae

101
notes/pdftract-22vzm.md Normal file
View file

@ -0,0 +1,101 @@
# Phase 1.7: PDF Structural Fingerprint — Coordinator Close Summary
## Overview
Phase 1.7 (PDF Structural Fingerprint) is complete. All 4 child beads have been closed, implementing a reproducible 256-bit content hash that identifies the semantic content of a PDF independent of metadata churn, byte ordering, and producer-tool re-saves.
## Child Beads Closed
1. **pdftract-q15sh** — Fingerprint algorithm implementation (Merkle SHA-256 over canonicalized inputs)
- Location: `crates/pdftract-core/src/fingerprint/mod.rs`
- See: `notes/pdftract-q15sh.md`
2. **pdftract-154mz** — Per-page input canonicalization (geometry rounding, content stream normalization)
- Location: `crates/pdftract-core/src/fingerprint/canonicalize.rs`
- See: `notes/pdftract-154mz.md`
3. **pdftract-3954u**`pdftract hash` CLI subcommand with exit codes 0/2/3/4/5/6
- Location: `crates/pdftract-cli/src/main.rs`
- See: `notes/pdftract-3954u.md`
4. **pdftract-ef6xz** — Fingerprint reproducibility test corpus
- Location: `tests/fingerprint/fixtures/`
- See: `notes/pdftract-ef6xz.md`
## Acceptance Criteria Verification
### ✅ All 4 child beads closed
All child beads show Status: closed in `bf show`.
### ✅ Algorithm v1 stable and documented
Documentation exists at `crates/pdftract-core/src/fingerprint/algorithm.md` (5.1 KB).
Algorithm version prefix is `pdftract-v1:` (INV-13 compliant).
### ✅ All 5 Critical tests pass (73 total fingerprint tests)
Test run on 2025-06-03:
```
cargo nextest run -j 4 --package pdftract-core fingerprint
Summary: 73 tests run: 73 passed, 3083 skipped
```
Critical tests verified:
- `test_fixture_acrobat_resave` — Acrobat re-saves produce identical fingerprint ✅
- `test_fixture_pdftk_resave` — pdftk re-saves produce identical fingerprint ✅
- `test_fixture_metadata_only` — CreationDate-only changes preserve fingerprint ✅
- `test_fixture_content_edit_one_glyph` — Content edits change fingerprint ✅
- `test_fixture_linearization_toggle` — Linearization toggle preserves fingerprint (KU-7) ✅
### ✅ INV-3 (byte-stable across runs)
Test: `test_inv3_reproducibility_100_invocations`
Verifies 100 invocations on the same PDF produce identical fingerprint output.
Result: PASS (0.011s runtime)
### ✅ INV-13 (version prefix on every emission)
Tests: `test_inv13_fingerprint_format`, `test_inv13_multiple_outputs_match_format`
Verifies output matches regex `^pdftract-v1:[0-9a-f]{64}$`
Result: PASS
### ✅ CLI command functional
```bash
$ ./target/release/pdftract hash tests/fingerprint/fixtures/acrobat_resave/v1.pdf
pdftract-v1:ab24a95f44ceca5d2aed4b6d056adddd8539f44c6cd6ca506534e830c82ea8a8
```
## Implementation Summary
### Algorithm (Merkle-style SHA-256)
Inputs in deterministic order:
1. Page count (u32, big-endian)
2. Per page (in page_index order):
a. SHA-256 of decoded, token-normalized content streams
b. SHA-256 of resolved resource dict (sorted by namespace/key)
c. Page geometry: MediaBox, CropBox, Rotate (4-decimal fixed-point)
3. Structure tree SHA-256 (if tagged; all-zero hash otherwise)
4. Catalog feature flag byte (encrypted, javascript, XFA, OCG bits)
Deliberately excluded (per ADR-008):
- `/Producer`, `/Creator`, `/CreationDate`, `/ModDate`, `/Author`, `/Title`, `/Subject`, `/Keywords`
- `/ID` array, XMP `/Metadata` stream
- xref byte layout, object number assignment
- Inline whitespace (lexer-normalized before hashing)
### CLI Interface
- Command: `pdftract hash <INPUT>` (file path or URL)
- Output: `pdftract-v1:<64-hex-chars>\n`
- Exit codes: 0 success, 2 corrupt, 3 encrypted-no-password, 4 cannot read, 5 network failure, 6 TLS handshake failure
## Downstream Dependencies
This fingerprint is now available for:
- **Phase 6.8 receipts** — binding identity for receipt verification
- **Phase 6.9 cache** — content-addressed cache key for extraction results
- **Users** — identify PDFs across re-saves via `pdftract hash`
## Related Plan References
- Plan: Phase 1.7 PDF Structural Fingerprint (lines 1182-1219)
- ADR-008 (fingerprint excludes metadata)
- INV-3, INV-13 (byte-stability, version prefix)
- KU-7 (linearization toggle test)
## Git Commits
This coordinator bead did not introduce new code commits; its child beads each produced their own commits.
See child bead verification notes for individual commit details.