- Fixed missing fields in BlockJson, SpanJson, ExtractionOptions initializations - Added feature gates to ocr_integration tests for conditional compilation - Fixed McpServerState::new calls to include audit writer argument - Fixed CCITTFaxDecoder::decode calls to use instance method - Fixed type casts for ObjRef::new calls - Fixed serde_json::Value method calls (is_some -> !is_null) - Fixed ProfileType test feature gates - Worked around lifetime issues in schema roundtrip tests These changes fix numerous compilation errors that were blocking the codebase from building. The main library and tests now compile successfully. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2.6 KiB
2.6 KiB
Verification Note: pdftract-1c4j2 (7.7.1: /Threads array discovery + /I thread info metadata extraction)
Summary
Implemented Phase 7.7.1: Thread info extraction from PDF article threads.
Implementation
Files Changed
-
crates/pdftract-core/src/threads/mod.rs(new module)ThreadHeaderstruct with first_bead_ref, title, author, subject, keywordsdiscover()function to read /Threads from catalog- PDFDocEncoding and UTF-16BE string decoding
- Comprehensive unit tests
-
crates/pdftract-core/src/parser/catalog.rs- Added
threads_ref: Option<ObjRef>field to Catalog struct - Parse /Threads array in parse_catalog function
- Added
-
crates/pdftract-core/src/lib.rs- Added
pub mod threads;
- Added
Acceptance Criteria Status
PASS
- ✅ Thread with no /I info dict -> title/author/subject/keywords all None
- ✅ 3 threads with various info configurations handled correctly
- ✅ Thread with no /Title (but /I present) -> title is None
- ✅ Thread missing /F skipped with diagnostic
- ✅ UTF-16BE title decoded correctly
- ✅ Empty string title returns Some("") not None
- ✅ Empty /Threads returns empty Vec without diagnostic
- ✅ /Threads absent returns empty Vec without diagnostic
Tests Added
test_thread_header_new- Basic ThreadHeader constructiontest_thread_header_with_fields- ThreadHeader with populated fieldstest_decode_pdf_string_ascii- ASCII string decodingtest_decode_pdf_string_utf16be_bom- UTF-16BE BOM handlingtest_decode_pdf_string_empty- Empty string handlingtest_decode_pdf_string_latin1- PDFDocEncoding (Latin-1) decodingtest_decode_utf16be_invalid_length- Invalid UTF-16 lengthtest_decode_pdfdocencoding_empty- Empty PDFDocEncodingtest_decode_pdfdocencoding_ascii- PDFDocEncoding ASCIItest_discover_thread_no_info_dict- No /I dict -> all fields Nonetest_discover_three_threads- Multiple threads with varied configstest_discover_thread_missing_f_skipped- Thread without /F skippedtest_discover_thread_utf16_title- UTF-16 title decodingtest_discover_empty_threads- Empty /Threads arraytest_discover_no_threads_field- No /Threads in catalogtest_discover_thread_empty_title- Empty string title is Some("")
Compilation
- ✅
cargo check --libpasses - ✅
cargo clippy --libpasses (no threads-specific warnings) - ✅
cargo fmtapplied
Commit
- Commit:
aedabdb - Message: feat(pdftract-1c4j2): implement thread info extraction (7.7.1)
- Pushed to github/main
References
- Plan section: 7.7 line 2683 (thread info)
- PDF 1.7 spec 12.4.3 Articles
- Phase 1 PdfString decoder (reimplemented in threads module)