pdftract/tests/fingerprint/fixtures
jedarden 246befd8d1 feat(pdftract-2m3gl): implement PHP SDK with Packagist publishing
- Add jedarden/pdftract Composer package (sdk/php/)
- Implement Client.php with proc_open subprocess execution
- Add PSR-3 LoggerInterface integration (defaults to NullLogger)
- Add 9 contract methods: extract, extractText, extractMarkdown, extractStream, search, getMetadata, hash, classify, verifyReceipt
- Add readonly model classes: Document, Page, Metadata, Fingerprint, Classification, Match, Receipt
- Add exception classes: PdftractException base + 8 subclasses
- Add PHPUnit conformance test suite
- Add phpunit.xml configuration
- Add composer.json with jedarden/pdftract package name
- Add .ci/argo-workflows/pdftract-php-publish.yaml (Packagist auto-discovery from git tags)

Also includes Ruby SDK scaffold from parallel workflow.

Closes pdftract-2m3gl
2026-06-01 10:27:03 -04:00
..
__pycache__ fix(pdftract-4pnmd): build.rs doc comment format string parsing 2026-05-28 14:36:45 -04:00
acrobat_resave feat(pdftract-2m3gl): implement PHP SDK with Packagist publishing 2026-06-01 10:27:03 -04:00
byte_identical feat(pdftract-2m3gl): implement PHP SDK with Packagist publishing 2026-06-01 10:27:03 -04:00
content_edit_one_glyph feat(pdftract-2m3gl): implement PHP SDK with Packagist publishing 2026-06-01 10:27:03 -04:00
content_edit_one_paragraph feat(pdftract-2m3gl): implement PHP SDK with Packagist publishing 2026-06-01 10:27:03 -04:00
linearization_toggle feat(pdftract-2m3gl): implement PHP SDK with Packagist publishing 2026-06-01 10:27:03 -04:00
metadata_only feat(pdftract-2m3gl): implement PHP SDK with Packagist publishing 2026-06-01 10:27:03 -04:00
pdftk_resave feat(pdftract-2m3gl): implement PHP SDK with Packagist publishing 2026-06-01 10:27:03 -04:00
qpdf_resave feat(pdftract-2m3gl): implement PHP SDK with Packagist publishing 2026-06-01 10:27:03 -04:00
.clean_source.pdf feat(pdftract-2m3gl): implement PHP SDK with Packagist publishing 2026-06-01 10:27:03 -04:00
check_compression.py fix(bf-1avnz): remove .code field access on String diagnostics in serve.rs 2026-06-01 04:14:05 -04:00
check_trailer.py fix(bf-1avnz): remove .code field access on String diagnostics in serve.rs 2026-06-01 04:14:05 -04:00
create_fixtures.py wip: intermediate state from previous work 2026-05-29 08:25:23 -04:00
generate_fingerprint_fixtures.py fix(pdftract-25igv): fix emit! macro usage in codespace parser 2026-05-28 07:29:33 -04:00
inspect_fixtures.py feat(pdftract-91e1i): HTTP fetch sequence implementation 2026-05-28 13:17:00 -04:00
README.md fix(pyo3): correct extract_text_fn call in extract_markdown stub 2026-05-28 20:28:25 -04:00

Fingerprint Reproducibility Test Fixtures

This directory contains fixture pairs that verify the fingerprint algorithm's reproducibility and content-sensitivity properties.

Fixture Provenance

All fixtures are generated from a clean source PDF (.clean_source.pdf) created using pikepdf, a Python library for PDF manipulation. The source is a 3-page PDF with Lorem Ipsum text, created with minimal metadata.

Generation

Fixtures are generated using generate_fingerprint_fixtures.py, which requires:

  • Python 3.11+
  • pikepdf library (install via nix-shell or pip)
nix-shell --pure --packages python3 python3Packages.pikepdf --run \
  'python3 tests/fingerprint/fixtures/generate_fingerprint_fixtures.py'

Fixture Pairs

Each fixture pair contains:

  • v1.pdf - Original or first variant
  • v2.pdf - Second variant (modified copy or re-saved version)
  • expected.txt - Either "MATCH" (fingerprints should be identical) or "DIFFER" (fingerprints should differ)

1. byte_identical

Expected: MATCH

  • Same PDF copied twice (verifies fingerprint determinism)

2. acrobat_resave

Expected: MATCH

  • Simulates Acrobat re-save using qpdf
  • Changes /CreationDate, /ID, and xref byte layout
  • Preserves content (metadata-only changes should not affect fingerprint per ADR-008)

3. pdftk_resave

Expected: MATCH

  • Simulates pdftk re-save using qpdf
  • Changes object stream layout and compression
  • Content should produce identical fingerprint

4. qpdf_resave

Expected: MATCH

  • Same source through qpdf with --object-streams=preserve --normalize-content=y
  • Verifies qpdf re-save produces same fingerprint

5. linearization_toggle

Expected: MATCH (KU-7)

  • Unlinearized PDF vs qpdf --linearize output
  • Different byte layouts but same content
  • Verifies linearization independence (KU-7 requirement)

6. metadata_only

Expected: MATCH (ADR-008)

  • Original vs copy with changed /Title, /Author, /Producer, /CreationDate
  • Verifies metadata independence per ADR-008

7. content_edit_one_glyph

Expected: DIFFER

  • "Hello World" vs "Hello Worl" (one character removed)
  • Verifies content-sensitivity: removing a single glyph changes fingerprint

8. content_edit_one_paragraph

Expected: DIFFER

  • Original paragraph vs variant with one word changed
  • Verifies content-sensitivity: paragraph edit changes fingerprint

License

The fixture PDFs are generated using MIT-licensed tools (pikepdf, qpdf) and contain public-domain text (Lorem Ipsum). Fixtures are MIT-licensed.

References

  • ADR-008: Metadata independence
  • KU-7: Linearization independence
  • INV-3: Fingerprint reproducibility (100 invocations produce identical results)
  • INV-13: Fingerprint format (^pdftract-v1:[0-9a-f]{64}$)