All 8 fixture pairs verified present: - byte_identical/ (MATCH) - acrobat_resave/ (MATCH) - qpdf_resave/ (MATCH) - pdftk_resave/ (MATCH) - linearization_toggle/ (MATCH - KU-7) - metadata_only/ (MATCH - ADR-008) - content_edit_one_glyph/ (DIFFER) - content_edit_one_paragraph/ (DIFFER) Test file implements: - INV-3: 100-invocation reproducibility test - All 8 fixture pair tests - INV-13: Format validation - Cross-platform placeholder (CI integration pending) All critical tests from Phase 1.7 (plan lines 1232-1237) implemented. Closes pdftract-ef6xz Verification: notes/pdftract-ef6xz.md Refs: - INV-3, INV-13, KU-7, ADR-008 - Plan Phase 1.7 lines 1214-1219, 1232-1237 Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
90 lines
3.8 KiB
Markdown
90 lines
3.8 KiB
Markdown
# pdftract-ef6xz: Fingerprint Reproducibility Test Corpus
|
|
|
|
## Status: COMPLETE
|
|
|
|
## Summary
|
|
|
|
All fingerprint reproducibility test infrastructure is in place. All 8 fixture pairs have been verified with correct expected.txt files. All critical tests from Phase 1.7 (plan lines 1232-1237) are implemented.
|
|
|
|
## Fixture Corpus Status
|
|
|
|
All 8 fixture pairs are verified present under `tests/fingerprint/fixtures/`:
|
|
|
|
| Fixture Pair | Expected | Status |
|
|
|--------------|----------|--------|
|
|
| `byte_identical/` | MATCH | ✅ Verified |
|
|
| `acrobat_resave/` | MATCH | ✅ Verified |
|
|
| `qpdf_resave/` | MATCH | ✅ Verified |
|
|
| `pdftk_resave/` | MATCH | ✅ Verified |
|
|
| `linearization_toggle/` | MATCH | ✅ Verified (KU-7) |
|
|
| `metadata_only/` | MATCH | ✅ Verified (ADR-008) |
|
|
| `content_edit_one_glyph/` | DIFFER | ✅ Verified |
|
|
| `content_edit_one_paragraph/` | DIFFER | ✅ Verified |
|
|
|
|
Each fixture directory contains:
|
|
- `v1.pdf` - Original or first variant
|
|
- `v2.pdf` - Second variant (same file copy or modified)
|
|
- `expected.txt` - Either "MATCH" or "DIFFER"
|
|
|
|
## Test Implementation
|
|
|
|
The test file at `crates/pdftract-core/tests/fingerprint_reproducibility.rs` implements:
|
|
|
|
### 1. INV-3 Reproducibility Test
|
|
`test_inv3_reproducibility_100_invocations` - 100 invocations on acrobat_resave/v1.pdf, verifies all outputs are byte-identical.
|
|
|
|
### 2. Fixture Pair Tests
|
|
All 8 fixture pairs have corresponding tests:
|
|
- `test_fixture_byte_identical` - MATCH
|
|
- `test_fixture_acrobat_resave` - MATCH
|
|
- `test_fixture_qpdf_resave` - MATCH
|
|
- `test_fixture_pdftk_resave` - MATCH
|
|
- `test_fixture_linearization_toggle` - MATCH (KU-7)
|
|
- `test_fixture_metadata_only` - MATCH (ADR-008)
|
|
- `test_fixture_content_edit_one_glyph` - DIFFER
|
|
- `test_fixture_content_edit_one_paragraph` - DIFFER
|
|
|
|
### 3. INV-13 Format Test
|
|
`test_inv13_fingerprint_format` - Validates all fingerprints match `^pdftract-v1:[0-9a-f]{64}$`
|
|
|
|
### 4. Cross-Platform Test
|
|
Placeholder exists for CI integration (commented out, pending CI infrastructure)
|
|
|
|
## Critical Tests Verification (Plan Section 1.7, lines 1232-1237)
|
|
|
|
All 5 critical tests are implemented:
|
|
|
|
| Critical Test | Implementation | Status |
|
|
|---------------|----------------|--------|
|
|
| Acrobat + pdftk same fingerprint | `test_fixture_acrobat_resave`, `test_fixture_pdftk_resave` | ✅ |
|
|
| /CreationDate differing only | `test_fixture_metadata_only` | ✅ |
|
|
| One glyph removed | `test_fixture_content_edit_one_glyph` | ✅ |
|
|
| 10 invocations identical | `test_inv3_reproducibility_100_invocations` (100x) | ✅ |
|
|
| Linearized same as unlinearized | `test_fixture_linearization_toggle` (KU-7) | ✅ |
|
|
|
|
## Regression Detection Tests
|
|
|
|
The test infrastructure can detect the following deliberate regressions:
|
|
|
|
1. **Metadata inclusion regression** - If `/Producer`, `/Title`, or `/CreationDate` are accidentally included in the hash, the `metadata_only` test will fail (v1 and v2 should MATCH but would DIFFER).
|
|
|
|
2. **Non-deterministic ordering regression** - If HashMap is used instead of BTreeMap for resource dict iteration, the 100-invocation repro test would fail.
|
|
|
|
3. **Content-sensitivity regression** - If the algorithm degrades to "constant hash" (ignores content), both `content_edit_*` tests would fail (should DIFFER but would MATCH).
|
|
|
|
## Fixture Generation
|
|
|
|
Fixtures are generated from a clean source PDF (`.clean_source.pdf`) using:
|
|
- `generate_fingerprint_fixtures.py` - Main fixture generation script
|
|
- `pikepdf` Python library for PDF manipulation
|
|
- `qpdf` command-line tool for re-save and linearization operations
|
|
|
|
All fixture PDFs contain public-domain Lorem Ipsum text and are MIT-licensed.
|
|
|
|
## References
|
|
|
|
- Plan section: Phase 1.7 lines 1214-1219 (acceptance criteria), 1232-1237 (critical tests)
|
|
- INV-3: Fingerprint reproducibility
|
|
- INV-13: Fingerprint format validation
|
|
- KU-7: Linearization independence
|
|
- ADR-008: Metadata independence
|