The native PyO3 module returns raw dicts via pythonize, but the Python SDK API expects typed dataclass objects (Document, Page, Metadata, etc.) to be consistent with the subprocess fallback and test expectations. Updated wrapper functions in __init__.py to convert native results: - extract(): wraps dict in Document.from_dict() - extract_stream(): wraps yielded page dicts in Page.from_dict() - get_metadata(): wraps dict in Metadata() - hash(): wraps string in Fingerprint.from_string() - classify(): wraps dict in Classification() - search(): wraps yielded match dicts in Match The native PyO3 entry points (extract, extract_text, extract_stream) were already implemented with: - extract: uses extract_pdf + pythonize for PyDict conversion - extract_text: uses extract_text for plain String return - extract_stream: uses extract_pdf_streaming with custom StreamIterator All kwargs parsing with strict validation (unknown kwargs raise TypeError) was already in place. Acceptance criteria: - pdftract.extract() returns Document object with pages/metadata - pdftract.extract_text() returns plain text string - pdftract.extract_stream() yields Page objects - Unknown kwarg raises TypeError |
||
|---|---|---|
| .. | ||
| expected_backup | ||
| src | ||
| _temp_enc_rc4.pdf | ||
| base_hello.pdf | ||
| encrypted_aes128_test.expected.json | ||
| encrypted_aes128_test.pdf | ||
| encrypted_aes256_test.expected.json | ||
| encrypted_aes256_test.pdf | ||
| encrypted_empty_password.expected.json | ||
| encrypted_empty_password.pdf | ||
| encrypted_rc4_test.expected.json | ||
| encrypted_rc4_test.pdf | ||
| encrypted_unknown_handler.expected.json | ||
| encrypted_unknown_handler.pdf | ||
| generate_fixtures | ||
| generate_fixtures.rs | ||
| generate_fixtures.rs.disabled | ||
| generate_fixtures_new | ||
| inheritance_grandparent_mediabox.expected.json | ||
| inheritance_grandparent_mediabox.pdf | ||
| js_in_openaction.expected.json | ||
| js_in_openaction.pdf | ||
| missing_mediabox.expected.json | ||
| missing_mediabox.pdf | ||
| multi_revision_3.expected.json | ||
| multi_revision_3.pdf | ||
| ocg_default_off.expected.json | ||
| ocg_default_off.pdf | ||
| page_labels_roman_arabic.expected.json | ||
| page_labels_roman_arabic.pdf | ||
| partial_resource_override.expected.json | ||
| partial_resource_override.pdf | ||
| pdfa_1b_conformance.expected.json | ||
| pdfa_1b_conformance.pdf | ||
| README.md | ||
| tagged_3_level_outline.expected.json | ||
| tagged_3_level_outline.pdf | ||
| xfa_form.expected.json | ||
| xfa_form.pdf | ||
Document Model Test Fixtures
This directory contains curated PDF fixtures for testing the document model integration.
Fixture Passwords
IMPORTANT: The passwords for encrypted fixtures are NOT secret. They are test fixtures:
encrypted_rc4_test.pdf: RC4-40, password "test"encrypted_aes128_test.pdf: AES-128, password "test"encrypted_aes256_test.pdf: AES-256 (PDF 2.0), password "test"encrypted_empty_password.pdf: RC4-40, empty password
Fixture List
Encrypted Files (EC-04, EC-05, EC-06)
encrypted_rc4_test.pdf— RC4-encrypted, user password "test" (EC-04)encrypted_aes128_test.pdf— AES-128, password "test" (EC-05)encrypted_aes256_test.pdf— AES-256 (PDF 2.0), password "test" (EC-06)encrypted_empty_password.pdf— RC4-encrypted, empty owner passwordencrypted_unknown_handler.pdf— Custom handler (Adobe Public Key, /Filter /Adobe.PubSec)
Tagged PDFs
tagged_3_level_outline.pdf— 3 levels of bookmarks with mixed UTF-16BE/PDFDocEncoded titles
Optional Content (EC-16)
ocg_default_off.pdf— Single OCG with /D /BaseState /OFF (EC-16)
Multi-Revision
multi_revision_3.pdf— 3 incremental revisions, page count differs across revisions
Page Tree Inheritance (EC-09)
inheritance_grandparent_mediabox.pdf— page 0 has no MediaBox; inherits from grandparent /Pages nodemissing_mediabox.pdf— page with no MediaBox anywhere (EC-09)
Resource Merging
partial_resource_override.pdf— page overrides /Resources /Font partially; merged result expected
JavaScript Detection
js_in_openaction.pdf— /OpenAction /S /JavaScript
XFA Forms
xfa_form.pdf— /AcroForm /XFA present
Conformance Detection
pdfa_1b_conformance.pdf— XMP metadata declaring PDF/A-1B conformance
Page Labels
page_labels_roman_arabic.pdf— pages 0..3 roman, pages 4..end arabic
Fixture Generation
Fixtures are generated using qpdf and hand-crafted PDF construction.
See scripts/generate_document_model_fixtures.sh for generation scripts.