Phase 7.7.3: Add threads field to ExtractionResult with ThreadJson schema integration. Changes: - Added ThreadJson and BeadJson structs to schema/mod.rs - Added thread_to_json() function to threads/mod.rs - Added build_page_ref_to_index() helper to parser/pages.rs - Added threads field to ExtractionResult in extract.rs - Implemented Phase 7.7 extraction logic with discover_threads/walk_beads - Added threads_to_markdown() and collapse_page_ranges() to markdown.rs - Updated JSON schema with ThreadJson and BeadJson definitions - Added thread_to_py() and bead_to_py() conversions in pdftract-py - Exported ThreadJson, BeadJson from lib.rs All 32 threads module tests pass. All 35 markdown tests pass. Verification: notes/pdftract-3h9xo.md Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
4.7 KiB
4.7 KiB
pdftract-3h9xo: threads JSON output + schema integration
Bead Description
Phase 7.7.3: Add threads field to ExtractionResult with ThreadJson schema integration.
Implementation Summary
1. Schema (crates/pdftract-core/src/schema/mod.rs)
- Added
ThreadJsonstruct with fields: title, author, subject, keywords, beads - Added
BeadJsonstruct with fields: page_index, rect - Both structs derive Serialize, Deserialize, JsonSchema
2. Threads Module (crates/pdftract-core/src/threads/mod.rs)
- Added
thread_to_json()function to convert ThreadHeader + Bead slice to ThreadJson - Function properly handles UTF-16 decoded strings from PDF
3. Extraction Pipeline (crates/pdftract-core/src/extract.rs)
- Added
threads: Vec<ThreadJson>field to ExtractionResult - Implemented Phase 7.7 extraction logic:
- Build page_ref_to_index map for O(1) page lookups
- Call discover_threads to find thread headers
- Call walk_beads for each thread to collect bead chains
- Convert to ThreadJson via thread_to_json
4. Parser Helper (crates/pdftract-core/src/parser/pages.rs)
- Added
build_page_ref_to_index()helper function - Creates HashMap<ObjRef, usize> mapping page object refs to indices
- Handles pages tree traversal
5. Markdown Sink (crates/pdftract-core/src/markdown.rs)
- Added
threads_to_markdown()function - Added
collapse_page_ranges()helper for compact page display - Handles duplicate page indices correctly
- Format: "## Article Threads\n\n1. Title (Author) - pages X-Y (N beads)"
6. JSON Schema (docs/schema/v1.0/pdftract.schema.json)
- Added ThreadJson definition to $defs
- Added BeadJson definition to $defs
- Integrated threads into extraction result schema
7. PyO3 Bindings (crates/pdftract-py/src/lib.rs)
- Added thread_to_py() and bead_to_py() conversion functions
- Integrated threads into extract() function's Python dict output
- Threads returned as list of dicts with title, author, subject, keywords, beads fields
- Beads returned as 2-key dicts (page_index, rect) per spec
8. Core Exports (crates/pdftract-core/src/lib.rs)
- Added ThreadJson, BeadJson to pub use schema exports
Testing Results
PASS: Threads module tests
- All 32 threads tests pass
- test_bead_new, test_decode_* (string decoding tests)
- test_discover_* (thread discovery tests)
- test_thread_header_* (header parsing tests)
- test_walk_beads_* (bead chain walking tests)
PASS: Markdown tests
- All 35 markdown span_tests pass
- test_threads_to_markdown_empty
- test_threads_to_markdown_single_thread
- test_threads_to_markdown_multiple_threads
- test_threads_to_markdown_untitled_thread
- test_collapse_page_ranges_single_page
- test_collapse_page_ranges_contiguous
- test_collapse_page_ranges_gaps
- test_collapse_page_ranges_mixed
PASS: Build verification
- pdftract-core compiles successfully
- pdftract-cli compiles successfully
- pdftract-py compiles successfully
PASS: Schema generation
- JSON schema updated with ThreadJson and BeadJson definitions
- Proper $ref integration in extraction result
Acceptance Criteria
- ✅ ThreadJson struct added with title, author, subject, keywords, beads fields
- ✅ BeadJson struct added with page_index, rect fields
- ✅ thread_to_json conversion function implemented
- ✅ ExtractionResult includes threads field
- ✅ Phase 7.7 extraction logic implemented in extract.rs
- ✅ JSON schema updated with ThreadJson and BeadJson definitions
- ✅ threads_to_markdown function implemented for markdown sink
- ✅ PyO3 bindings expose threads in extract() output
- ✅ All threads module tests pass (32/32)
- ✅ All markdown tests pass (35/35)
Code Changes Summary
- crates/pdftract-core/src/lib.rs: Added ThreadJson, BeadJson exports
- crates/pdftract-core/src/schema/mod.rs: Added ThreadJson, BeadJson structs
- crates/pdftract-core/src/threads/mod.rs: Added thread_to_json function
- crates/pdftract-core/src/parser/pages.rs: Added build_page_ref_to_index helper
- crates/pdftract-core/src/extract.rs: Added threads field and Phase 7.7 extraction
- crates/pdftract-core/src/markdown.rs: Added threads_to_markdown and collapse_page_ranges
- docs/schema/v1.0/pdftract.schema.json: Added ThreadJson, BeadJson schema definitions
- crates/pdftract-py/src/lib.rs: Added thread_to_py, bead_to_py, integrated into extract()
Files Modified
- crates/pdftract-core/src/lib.rs
- crates/pdftract-core/src/schema/mod.rs
- crates/pdftract-core/src/threads/mod.rs
- crates/pdftract-core/src/parser/pages.rs
- crates/pdftract-core/src/extract.rs
- crates/pdftract-core/src/markdown.rs
- docs/schema/v1.0/pdftract.schema.json
- crates/pdftract-py/src/lib.rs
Status
COMPLETE - All acceptance criteria met. Threads are now extracted from PDFs and available in JSON output, markdown sink, and Python bindings.