All 8 fixture pairs verified present: - byte_identical/ (MATCH) - acrobat_resave/ (MATCH) - qpdf_resave/ (MATCH) - pdftk_resave/ (MATCH) - linearization_toggle/ (MATCH - KU-7) - metadata_only/ (MATCH - ADR-008) - content_edit_one_glyph/ (DIFFER) - content_edit_one_paragraph/ (DIFFER) Test file implements: - INV-3: 100-invocation reproducibility test - All 8 fixture pair tests - INV-13: Format validation - Cross-platform placeholder (CI integration pending) All critical tests from Phase 1.7 (plan lines 1232-1237) implemented. Closes pdftract-ef6xz Verification: notes/pdftract-ef6xz.md Refs: - INV-3, INV-13, KU-7, ADR-008 - Plan Phase 1.7 lines 1214-1219, 1232-1237 Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
3.8 KiB
pdftract-ef6xz: Fingerprint Reproducibility Test Corpus
Status: COMPLETE
Summary
All fingerprint reproducibility test infrastructure is in place. All 8 fixture pairs have been verified with correct expected.txt files. All critical tests from Phase 1.7 (plan lines 1232-1237) are implemented.
Fixture Corpus Status
All 8 fixture pairs are verified present under tests/fingerprint/fixtures/:
| Fixture Pair | Expected | Status |
|---|---|---|
byte_identical/ |
MATCH | ✅ Verified |
acrobat_resave/ |
MATCH | ✅ Verified |
qpdf_resave/ |
MATCH | ✅ Verified |
pdftk_resave/ |
MATCH | ✅ Verified |
linearization_toggle/ |
MATCH | ✅ Verified (KU-7) |
metadata_only/ |
MATCH | ✅ Verified (ADR-008) |
content_edit_one_glyph/ |
DIFFER | ✅ Verified |
content_edit_one_paragraph/ |
DIFFER | ✅ Verified |
Each fixture directory contains:
v1.pdf- Original or first variantv2.pdf- Second variant (same file copy or modified)expected.txt- Either "MATCH" or "DIFFER"
Test Implementation
The test file at crates/pdftract-core/tests/fingerprint_reproducibility.rs implements:
1. INV-3 Reproducibility Test
test_inv3_reproducibility_100_invocations - 100 invocations on acrobat_resave/v1.pdf, verifies all outputs are byte-identical.
2. Fixture Pair Tests
All 8 fixture pairs have corresponding tests:
test_fixture_byte_identical- MATCHtest_fixture_acrobat_resave- MATCHtest_fixture_qpdf_resave- MATCHtest_fixture_pdftk_resave- MATCHtest_fixture_linearization_toggle- MATCH (KU-7)test_fixture_metadata_only- MATCH (ADR-008)test_fixture_content_edit_one_glyph- DIFFERtest_fixture_content_edit_one_paragraph- DIFFER
3. INV-13 Format Test
test_inv13_fingerprint_format - Validates all fingerprints match ^pdftract-v1:[0-9a-f]{64}$
4. Cross-Platform Test
Placeholder exists for CI integration (commented out, pending CI infrastructure)
Critical Tests Verification (Plan Section 1.7, lines 1232-1237)
All 5 critical tests are implemented:
| Critical Test | Implementation | Status |
|---|---|---|
| Acrobat + pdftk same fingerprint | test_fixture_acrobat_resave, test_fixture_pdftk_resave |
✅ |
| /CreationDate differing only | test_fixture_metadata_only |
✅ |
| One glyph removed | test_fixture_content_edit_one_glyph |
✅ |
| 10 invocations identical | test_inv3_reproducibility_100_invocations (100x) |
✅ |
| Linearized same as unlinearized | test_fixture_linearization_toggle (KU-7) |
✅ |
Regression Detection Tests
The test infrastructure can detect the following deliberate regressions:
-
Metadata inclusion regression - If
/Producer,/Title, or/CreationDateare accidentally included in the hash, themetadata_onlytest will fail (v1 and v2 should MATCH but would DIFFER). -
Non-deterministic ordering regression - If HashMap is used instead of BTreeMap for resource dict iteration, the 100-invocation repro test would fail.
-
Content-sensitivity regression - If the algorithm degrades to "constant hash" (ignores content), both
content_edit_*tests would fail (should DIFFER but would MATCH).
Fixture Generation
Fixtures are generated from a clean source PDF (.clean_source.pdf) using:
generate_fingerprint_fixtures.py- Main fixture generation scriptpikepdfPython library for PDF manipulationqpdfcommand-line tool for re-save and linearization operations
All fixture PDFs contain public-domain Lorem Ipsum text and are MIT-licensed.
References
- Plan section: Phase 1.7 lines 1214-1219 (acceptance criteria), 1232-1237 (critical tests)
- INV-3: Fingerprint reproducibility
- INV-13: Fingerprint format validation
- KU-7: Linearization independence
- ADR-008: Metadata independence