jedarden 6000c654ce fix: resolve compilation errors across codebase

- Fixed missing fields in BlockJson, SpanJson, ExtractionOptions initializations
- Added feature gates to ocr_integration tests for conditional compilation
- Fixed McpServerState::new calls to include audit writer argument
- Fixed CCITTFaxDecoder::decode calls to use instance method
- Fixed type casts for ObjRef::new calls
- Fixed serde_json::Value method calls (is_some -> !is_null)
- Fixed ProfileType test feature gates
- Worked around lifetime issues in schema roundtrip tests

These changes fix numerous compilation errors that were blocking the
codebase from building. The main library and tests now compile successfully.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

2026-05-25 08:38:04 -04:00

2.6 KiB

Raw Blame History

Verification Note: pdftract-1c4j2 (7.7.1: /Threads array discovery + /I thread info metadata extraction)

Summary

Implemented Phase 7.7.1: Thread info extraction from PDF article threads.

Implementation

Files Changed

crates/pdftract-core/src/threads/mod.rs (new module)
- ThreadHeader struct with first_bead_ref, title, author, subject, keywords
- discover() function to read /Threads from catalog
- PDFDocEncoding and UTF-16BE string decoding
- Comprehensive unit tests
crates/pdftract-core/src/parser/catalog.rs
- Added threads_ref: Option<ObjRef> field to Catalog struct
- Parse /Threads array in parse_catalog function
crates/pdftract-core/src/lib.rs
- Added pub mod threads;

Acceptance Criteria Status

PASS

✅ Thread with no /I info dict -> title/author/subject/keywords all None
✅ 3 threads with various info configurations handled correctly
✅ Thread with no /Title (but /I present) -> title is None
✅ Thread missing /F skipped with diagnostic
✅ UTF-16BE title decoded correctly
✅ Empty string title returns Some("") not None
✅ Empty /Threads returns empty Vec without diagnostic
✅ /Threads absent returns empty Vec without diagnostic

Tests Added

test_thread_header_new - Basic ThreadHeader construction
test_thread_header_with_fields - ThreadHeader with populated fields
test_decode_pdf_string_ascii - ASCII string decoding
test_decode_pdf_string_utf16be_bom - UTF-16BE BOM handling
test_decode_pdf_string_empty - Empty string handling
test_decode_pdf_string_latin1 - PDFDocEncoding (Latin-1) decoding
test_decode_utf16be_invalid_length - Invalid UTF-16 length
test_decode_pdfdocencoding_empty - Empty PDFDocEncoding
test_decode_pdfdocencoding_ascii - PDFDocEncoding ASCII
test_discover_thread_no_info_dict - No /I dict -> all fields None
test_discover_three_threads - Multiple threads with varied configs
test_discover_thread_missing_f_skipped - Thread without /F skipped
test_discover_thread_utf16_title - UTF-16 title decoding
test_discover_empty_threads - Empty /Threads array
test_discover_no_threads_field - No /Threads in catalog
test_discover_thread_empty_title - Empty string title is Some("")

Compilation

✅ cargo check --lib passes
✅ cargo clippy --lib passes (no threads-specific warnings)
✅ cargo fmt applied

Commit

Commit: aedabdb
Message: feat(pdftract-1c4j2): implement thread info extraction (7.7.1)
Pushed to github/main

References

Plan section: 7.7 line 2683 (thread info)
PDF 1.7 spec 12.4.3 Articles
Phase 1 PdfString decoder (reimplemented in threads module)

2.6 KiB Raw Blame History

Verification Note: pdftract-1c4j2 (7.7.1: /Threads array discovery + /I thread info metadata extraction)

Summary

Implementation

Files Changed

Acceptance Criteria Status

PASS

Tests Added

Compilation

Commit

References

2.6 KiB

Raw Blame History