pdftract/tests/document_model/fixtures
jedarden bb7146cffe fix(pdftract-2uk9z): wrap native module results in typed Python objects
The native PyO3 module returns raw dicts via pythonize, but the Python SDK
API expects typed dataclass objects (Document, Page, Metadata, etc.) to be
consistent with the subprocess fallback and test expectations.

Updated wrapper functions in __init__.py to convert native results:
- extract(): wraps dict in Document.from_dict()
- extract_stream(): wraps yielded page dicts in Page.from_dict()
- get_metadata(): wraps dict in Metadata()
- hash(): wraps string in Fingerprint.from_string()
- classify(): wraps dict in Classification()
- search(): wraps yielded match dicts in Match

The native PyO3 entry points (extract, extract_text, extract_stream) were
already implemented with:
- extract: uses extract_pdf + pythonize for PyDict conversion
- extract_text: uses extract_text for plain String return
- extract_stream: uses extract_pdf_streaming with custom StreamIterator

All kwargs parsing with strict validation (unknown kwargs raise TypeError)
was already in place.

Acceptance criteria:
- pdftract.extract() returns Document object with pages/metadata
- pdftract.extract_text() returns plain text string
- pdftract.extract_stream() yields Page objects
- Unknown kwarg raises TypeError
2026-05-28 21:18:38 -04:00
..
expected_backup fix(pyo3): correct extract_text_fn call in extract_markdown stub 2026-05-28 20:28:25 -04:00
src fix(pdftract-2uk9z): wrap native module results in typed Python objects 2026-05-28 21:18:38 -04:00
_temp_enc_rc4.pdf fix(pdftract-2uk9z): wrap native module results in typed Python objects 2026-05-28 21:18:38 -04:00
base_hello.pdf fix(pdftract-2uk9z): wrap native module results in typed Python objects 2026-05-28 21:18:38 -04:00
encrypted_aes128_test.expected.json fix(pyo3): correct extract_text_fn call in extract_markdown stub 2026-05-28 20:28:25 -04:00
encrypted_aes128_test.pdf fix(pyo3): correct extract_text_fn call in extract_markdown stub 2026-05-28 20:28:25 -04:00
encrypted_aes256_test.expected.json fix(pyo3): correct extract_text_fn call in extract_markdown stub 2026-05-28 20:28:25 -04:00
encrypted_aes256_test.pdf fix(pyo3): correct extract_text_fn call in extract_markdown stub 2026-05-28 20:28:25 -04:00
encrypted_empty_password.expected.json fix(pyo3): correct extract_text_fn call in extract_markdown stub 2026-05-28 20:28:25 -04:00
encrypted_empty_password.pdf fix(pyo3): correct extract_text_fn call in extract_markdown stub 2026-05-28 20:28:25 -04:00
encrypted_rc4_test.expected.json fix(pyo3): correct extract_text_fn call in extract_markdown stub 2026-05-28 20:28:25 -04:00
encrypted_rc4_test.pdf fix(pdftract-2uk9z): wrap native module results in typed Python objects 2026-05-28 21:18:38 -04:00
encrypted_unknown_handler.expected.json fix(pyo3): correct extract_text_fn call in extract_markdown stub 2026-05-28 20:28:25 -04:00
encrypted_unknown_handler.pdf fix(pyo3): correct extract_text_fn call in extract_markdown stub 2026-05-28 20:28:25 -04:00
generate_fixtures fix(pyo3): correct extract_text_fn call in extract_markdown stub 2026-05-28 20:28:25 -04:00
generate_fixtures.rs fix(pdftract-2uk9z): wrap native module results in typed Python objects 2026-05-28 21:18:38 -04:00
generate_fixtures.rs.disabled fix(pdftract-2uk9z): wrap native module results in typed Python objects 2026-05-28 21:18:38 -04:00
generate_fixtures_new fix(pdftract-2uk9z): wrap native module results in typed Python objects 2026-05-28 21:18:38 -04:00
inheritance_grandparent_mediabox.expected.json fix(pyo3): correct extract_text_fn call in extract_markdown stub 2026-05-28 20:28:25 -04:00
inheritance_grandparent_mediabox.pdf fix(pdftract-2uk9z): wrap native module results in typed Python objects 2026-05-28 21:18:38 -04:00
js_in_openaction.expected.json fix(pyo3): correct extract_text_fn call in extract_markdown stub 2026-05-28 20:28:25 -04:00
js_in_openaction.pdf fix(pdftract-2uk9z): wrap native module results in typed Python objects 2026-05-28 21:18:38 -04:00
missing_mediabox.expected.json fix(pyo3): correct extract_text_fn call in extract_markdown stub 2026-05-28 20:28:25 -04:00
missing_mediabox.pdf fix(pdftract-2uk9z): wrap native module results in typed Python objects 2026-05-28 21:18:38 -04:00
multi_revision_3.expected.json fix(pyo3): correct extract_text_fn call in extract_markdown stub 2026-05-28 20:28:25 -04:00
multi_revision_3.pdf fix(pdftract-2uk9z): wrap native module results in typed Python objects 2026-05-28 21:18:38 -04:00
ocg_default_off.expected.json fix(pyo3): correct extract_text_fn call in extract_markdown stub 2026-05-28 20:28:25 -04:00
ocg_default_off.pdf fix(pdftract-2uk9z): wrap native module results in typed Python objects 2026-05-28 21:18:38 -04:00
page_labels_roman_arabic.expected.json fix(pyo3): correct extract_text_fn call in extract_markdown stub 2026-05-28 20:28:25 -04:00
page_labels_roman_arabic.pdf fix(pdftract-2uk9z): wrap native module results in typed Python objects 2026-05-28 21:18:38 -04:00
partial_resource_override.expected.json fix(pyo3): correct extract_text_fn call in extract_markdown stub 2026-05-28 20:28:25 -04:00
partial_resource_override.pdf fix(pdftract-2uk9z): wrap native module results in typed Python objects 2026-05-28 21:18:38 -04:00
pdfa_1b_conformance.expected.json fix(pyo3): correct extract_text_fn call in extract_markdown stub 2026-05-28 20:28:25 -04:00
pdfa_1b_conformance.pdf fix(pdftract-2uk9z): wrap native module results in typed Python objects 2026-05-28 21:18:38 -04:00
README.md feat(pdftract-91e1i): HTTP fetch sequence implementation 2026-05-28 13:17:00 -04:00
tagged_3_level_outline.expected.json fix(pyo3): correct extract_text_fn call in extract_markdown stub 2026-05-28 20:28:25 -04:00
tagged_3_level_outline.pdf fix(pdftract-2uk9z): wrap native module results in typed Python objects 2026-05-28 21:18:38 -04:00
xfa_form.expected.json fix(pyo3): correct extract_text_fn call in extract_markdown stub 2026-05-28 20:28:25 -04:00
xfa_form.pdf fix(pdftract-2uk9z): wrap native module results in typed Python objects 2026-05-28 21:18:38 -04:00

Document Model Test Fixtures

This directory contains curated PDF fixtures for testing the document model integration.

Fixture Passwords

IMPORTANT: The passwords for encrypted fixtures are NOT secret. They are test fixtures:

  • encrypted_rc4_test.pdf: RC4-40, password "test"
  • encrypted_aes128_test.pdf: AES-128, password "test"
  • encrypted_aes256_test.pdf: AES-256 (PDF 2.0), password "test"
  • encrypted_empty_password.pdf: RC4-40, empty password

Fixture List

Encrypted Files (EC-04, EC-05, EC-06)

  • encrypted_rc4_test.pdf — RC4-encrypted, user password "test" (EC-04)
  • encrypted_aes128_test.pdf — AES-128, password "test" (EC-05)
  • encrypted_aes256_test.pdf — AES-256 (PDF 2.0), password "test" (EC-06)
  • encrypted_empty_password.pdf — RC4-encrypted, empty owner password
  • encrypted_unknown_handler.pdf — Custom handler (Adobe Public Key, /Filter /Adobe.PubSec)

Tagged PDFs

  • tagged_3_level_outline.pdf — 3 levels of bookmarks with mixed UTF-16BE/PDFDocEncoded titles

Optional Content (EC-16)

  • ocg_default_off.pdf — Single OCG with /D /BaseState /OFF (EC-16)

Multi-Revision

  • multi_revision_3.pdf — 3 incremental revisions, page count differs across revisions

Page Tree Inheritance (EC-09)

  • inheritance_grandparent_mediabox.pdf — page 0 has no MediaBox; inherits from grandparent /Pages node
  • missing_mediabox.pdf — page with no MediaBox anywhere (EC-09)

Resource Merging

  • partial_resource_override.pdf — page overrides /Resources /Font partially; merged result expected

JavaScript Detection

  • js_in_openaction.pdf — /OpenAction /S /JavaScript

XFA Forms

  • xfa_form.pdf — /AcroForm /XFA present

Conformance Detection

  • pdfa_1b_conformance.pdf — XMP metadata declaring PDF/A-1B conformance

Page Labels

  • page_labels_roman_arabic.pdf — pages 0..3 roman, pages 4..end arabic

Fixture Generation

Fixtures are generated using qpdf and hand-crafted PDF construction.

See scripts/generate_document_model_fixtures.sh for generation scripts.