pdftract/tests/fingerprint/fixtures
jedarden d0f52751ce fix(pdftract-39gey): fix indent trigger to not split drop-cap paragraphs
The indent trigger was using .abs() which fired on both increased indent
(non-indented → indented) AND decreased indent (indented → non-indented).
This caused drop-cap style paragraphs (indented first line, flush-left
continuation) to incorrectly split into two blocks.

Per plan Phase 4.4 heuristic #2, indent change should only trigger when the
current line is MORE indented (to the right, larger x0) than the block
average - i.e., a new paragraph starting after non-indented text. It should
NOT trigger for decreased indent (first line indented, rest flush-left).

Fix: Remove .abs() and only check if line_x0 - block_avg_x0 > threshold.

Tests:
- test_indented_first_line_new_block: PASS (non-indented → indented splits)
- test_indented_first_line_of_paragraph_not_split: PASS (drop cap stays together)
- All 179 line module tests: PASS
2026-06-07 13:43:19 -04:00
..
__pycache__ fix(pdftract-39gey): fix indent trigger to not split drop-cap paragraphs 2026-06-07 13:43:19 -04:00
acrobat_resave [pdftract-ef6xz]: Complete fingerprint reproducibility test corpus 2026-06-02 13:32:26 -04:00
byte_identical [pdftract-ef6xz]: Complete fingerprint reproducibility test corpus 2026-06-02 13:32:26 -04:00
content_edit_one_glyph fix(pdftract-39gey): fix indent trigger to not split drop-cap paragraphs 2026-06-07 13:43:19 -04:00
content_edit_one_paragraph [pdftract-ef6xz]: Complete fingerprint reproducibility test corpus 2026-06-02 13:32:26 -04:00
linearization_toggle fix(pdftract-39gey): fix indent trigger to not split drop-cap paragraphs 2026-06-07 13:43:19 -04:00
metadata_only [pdftract-ef6xz]: Complete fingerprint reproducibility test corpus 2026-06-02 13:32:26 -04:00
pdftk_resave [pdftract-ef6xz]: Complete fingerprint reproducibility test corpus 2026-06-02 13:32:26 -04:00
qpdf_resave [pdftract-ef6xz]: Complete fingerprint reproducibility test corpus 2026-06-02 13:32:26 -04:00
.clean_source.pdf [pdftract-ef6xz]: Complete fingerprint reproducibility test corpus 2026-06-02 13:32:26 -04:00
check_compression.py fix(bf-1avnz): remove .code field access on String diagnostics in serve.rs 2026-06-01 04:14:05 -04:00
check_trailer.py fix(bf-1avnz): remove .code field access on String diagnostics in serve.rs 2026-06-01 04:14:05 -04:00
create_fixtures.py wip: intermediate state from previous work 2026-05-29 08:25:23 -04:00
debug_content_streams.py [pdftract-ef6xz]: Complete fingerprint reproducibility test corpus 2026-06-02 13:32:26 -04:00
generate_fingerprint_fixtures.py fix(pdftract-25igv): fix emit! macro usage in codespace parser 2026-05-28 07:29:33 -04:00
generate_fingerprint_fixtures_pikepdf.py [pdftract-ef6xz]: Complete fingerprint reproducibility test corpus 2026-06-02 13:32:26 -04:00
inspect_fixtures.py feat(pdftract-91e1i): HTTP fetch sequence implementation 2026-05-28 13:17:00 -04:00
README.md fix(pyo3): correct extract_text_fn call in extract_markdown stub 2026-05-28 20:28:25 -04:00

Fingerprint Reproducibility Test Fixtures

This directory contains fixture pairs that verify the fingerprint algorithm's reproducibility and content-sensitivity properties.

Fixture Provenance

All fixtures are generated from a clean source PDF (.clean_source.pdf) created using pikepdf, a Python library for PDF manipulation. The source is a 3-page PDF with Lorem Ipsum text, created with minimal metadata.

Generation

Fixtures are generated using generate_fingerprint_fixtures.py, which requires:

  • Python 3.11+
  • pikepdf library (install via nix-shell or pip)
nix-shell --pure --packages python3 python3Packages.pikepdf --run \
  'python3 tests/fingerprint/fixtures/generate_fingerprint_fixtures.py'

Fixture Pairs

Each fixture pair contains:

  • v1.pdf - Original or first variant
  • v2.pdf - Second variant (modified copy or re-saved version)
  • expected.txt - Either "MATCH" (fingerprints should be identical) or "DIFFER" (fingerprints should differ)

1. byte_identical

Expected: MATCH

  • Same PDF copied twice (verifies fingerprint determinism)

2. acrobat_resave

Expected: MATCH

  • Simulates Acrobat re-save using qpdf
  • Changes /CreationDate, /ID, and xref byte layout
  • Preserves content (metadata-only changes should not affect fingerprint per ADR-008)

3. pdftk_resave

Expected: MATCH

  • Simulates pdftk re-save using qpdf
  • Changes object stream layout and compression
  • Content should produce identical fingerprint

4. qpdf_resave

Expected: MATCH

  • Same source through qpdf with --object-streams=preserve --normalize-content=y
  • Verifies qpdf re-save produces same fingerprint

5. linearization_toggle

Expected: MATCH (KU-7)

  • Unlinearized PDF vs qpdf --linearize output
  • Different byte layouts but same content
  • Verifies linearization independence (KU-7 requirement)

6. metadata_only

Expected: MATCH (ADR-008)

  • Original vs copy with changed /Title, /Author, /Producer, /CreationDate
  • Verifies metadata independence per ADR-008

7. content_edit_one_glyph

Expected: DIFFER

  • "Hello World" vs "Hello Worl" (one character removed)
  • Verifies content-sensitivity: removing a single glyph changes fingerprint

8. content_edit_one_paragraph

Expected: DIFFER

  • Original paragraph vs variant with one word changed
  • Verifies content-sensitivity: paragraph edit changes fingerprint

License

The fixture PDFs are generated using MIT-licensed tools (pikepdf, qpdf) and contain public-domain text (Lorem Ipsum). Fixtures are MIT-licensed.

References

  • ADR-008: Metadata independence
  • KU-7: Linearization independence
  • INV-3: Fingerprint reproducibility (100 invocations produce identical results)
  • INV-13: Fingerprint format (^pdftract-v1:[0-9a-f]{64}$)