pdftract/tests/fingerprint/fixtures
jedarden 928a64ebc9 [pdftract-ef6xz]: Complete fingerprint reproducibility test corpus
All 8 fixture pairs verified present:
- byte_identical/ (MATCH)
- acrobat_resave/ (MATCH)
- qpdf_resave/ (MATCH)
- pdftk_resave/ (MATCH)
- linearization_toggle/ (MATCH - KU-7)
- metadata_only/ (MATCH - ADR-008)
- content_edit_one_glyph/ (DIFFER)
- content_edit_one_paragraph/ (DIFFER)

Test file implements:
- INV-3: 100-invocation reproducibility test
- All 8 fixture pair tests
- INV-13: Format validation
- Cross-platform placeholder (CI integration pending)

All critical tests from Phase 1.7 (plan lines 1232-1237) implemented.

Closes pdftract-ef6xz
Verification: notes/pdftract-ef6xz.md

Refs:
- INV-3, INV-13, KU-7, ADR-008
- Plan Phase 1.7 lines 1214-1219, 1232-1237

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-02 13:32:26 -04:00
..
__pycache__ [pdftract-ef6xz]: Complete fingerprint reproducibility test corpus 2026-06-02 13:32:26 -04:00
acrobat_resave [pdftract-ef6xz]: Complete fingerprint reproducibility test corpus 2026-06-02 13:32:26 -04:00
byte_identical [pdftract-ef6xz]: Complete fingerprint reproducibility test corpus 2026-06-02 13:32:26 -04:00
content_edit_one_glyph [pdftract-ef6xz]: Complete fingerprint reproducibility test corpus 2026-06-02 13:32:26 -04:00
content_edit_one_paragraph [pdftract-ef6xz]: Complete fingerprint reproducibility test corpus 2026-06-02 13:32:26 -04:00
linearization_toggle [pdftract-ef6xz]: Complete fingerprint reproducibility test corpus 2026-06-02 13:32:26 -04:00
metadata_only [pdftract-ef6xz]: Complete fingerprint reproducibility test corpus 2026-06-02 13:32:26 -04:00
pdftk_resave [pdftract-ef6xz]: Complete fingerprint reproducibility test corpus 2026-06-02 13:32:26 -04:00
qpdf_resave [pdftract-ef6xz]: Complete fingerprint reproducibility test corpus 2026-06-02 13:32:26 -04:00
.clean_source.pdf [pdftract-ef6xz]: Complete fingerprint reproducibility test corpus 2026-06-02 13:32:26 -04:00
check_compression.py fix(bf-1avnz): remove .code field access on String diagnostics in serve.rs 2026-06-01 04:14:05 -04:00
check_trailer.py fix(bf-1avnz): remove .code field access on String diagnostics in serve.rs 2026-06-01 04:14:05 -04:00
create_fixtures.py wip: intermediate state from previous work 2026-05-29 08:25:23 -04:00
debug_content_streams.py [pdftract-ef6xz]: Complete fingerprint reproducibility test corpus 2026-06-02 13:32:26 -04:00
generate_fingerprint_fixtures.py fix(pdftract-25igv): fix emit! macro usage in codespace parser 2026-05-28 07:29:33 -04:00
generate_fingerprint_fixtures_pikepdf.py [pdftract-ef6xz]: Complete fingerprint reproducibility test corpus 2026-06-02 13:32:26 -04:00
inspect_fixtures.py feat(pdftract-91e1i): HTTP fetch sequence implementation 2026-05-28 13:17:00 -04:00
README.md fix(pyo3): correct extract_text_fn call in extract_markdown stub 2026-05-28 20:28:25 -04:00

Fingerprint Reproducibility Test Fixtures

This directory contains fixture pairs that verify the fingerprint algorithm's reproducibility and content-sensitivity properties.

Fixture Provenance

All fixtures are generated from a clean source PDF (.clean_source.pdf) created using pikepdf, a Python library for PDF manipulation. The source is a 3-page PDF with Lorem Ipsum text, created with minimal metadata.

Generation

Fixtures are generated using generate_fingerprint_fixtures.py, which requires:

  • Python 3.11+
  • pikepdf library (install via nix-shell or pip)
nix-shell --pure --packages python3 python3Packages.pikepdf --run \
  'python3 tests/fingerprint/fixtures/generate_fingerprint_fixtures.py'

Fixture Pairs

Each fixture pair contains:

  • v1.pdf - Original or first variant
  • v2.pdf - Second variant (modified copy or re-saved version)
  • expected.txt - Either "MATCH" (fingerprints should be identical) or "DIFFER" (fingerprints should differ)

1. byte_identical

Expected: MATCH

  • Same PDF copied twice (verifies fingerprint determinism)

2. acrobat_resave

Expected: MATCH

  • Simulates Acrobat re-save using qpdf
  • Changes /CreationDate, /ID, and xref byte layout
  • Preserves content (metadata-only changes should not affect fingerprint per ADR-008)

3. pdftk_resave

Expected: MATCH

  • Simulates pdftk re-save using qpdf
  • Changes object stream layout and compression
  • Content should produce identical fingerprint

4. qpdf_resave

Expected: MATCH

  • Same source through qpdf with --object-streams=preserve --normalize-content=y
  • Verifies qpdf re-save produces same fingerprint

5. linearization_toggle

Expected: MATCH (KU-7)

  • Unlinearized PDF vs qpdf --linearize output
  • Different byte layouts but same content
  • Verifies linearization independence (KU-7 requirement)

6. metadata_only

Expected: MATCH (ADR-008)

  • Original vs copy with changed /Title, /Author, /Producer, /CreationDate
  • Verifies metadata independence per ADR-008

7. content_edit_one_glyph

Expected: DIFFER

  • "Hello World" vs "Hello Worl" (one character removed)
  • Verifies content-sensitivity: removing a single glyph changes fingerprint

8. content_edit_one_paragraph

Expected: DIFFER

  • Original paragraph vs variant with one word changed
  • Verifies content-sensitivity: paragraph edit changes fingerprint

License

The fixture PDFs are generated using MIT-licensed tools (pikepdf, qpdf) and contain public-domain text (Lorem Ipsum). Fixtures are MIT-licensed.

References

  • ADR-008: Metadata independence
  • KU-7: Linearization independence
  • INV-3: Fingerprint reproducibility (100 invocations produce identical results)
  • INV-13: Fingerprint format (^pdftract-v1:[0-9a-f]{64}$)