docs(pdftract-2kpm0): add verification note

This commit is contained in:
jedarden 2026-05-25 11:24:53 -04:00
parent fa57ab3e90
commit 92b0643331

67
notes/pdftract-2kpm0.md Normal file
View file

@ -0,0 +1,67 @@
# Verification Note: pdftract-2kpm0
## Summary
Implemented NDJSON frame types with unified `NdjsonFrame` enum using serde internal tagging and `write_frame` helper function.
## Changes Made
### Core Implementation (`crates/pdftract-core/src/output/ndjson/frames.rs`)
- Added `NdjsonFrame` enum with serde internal tagging (`#[serde(tag = "frame", rename_all = "lowercase")]`)
- `NdjsonFrame::Header(HeaderFrame)`
- `NdjsonFrame::Page(PageFrame)`
- `NdjsonFrame::Footer(FooterFrame)`
- Updated frame structs to remove `frame_type` field (now handled by enum tagging):
- `HeaderFrame`: schema_version, metadata, outline, total_pages
- `PageFrame`: page_index, page_type, spans, blocks, tables, annotations, errors
- `FooterFrame`: extraction_quality, errors, threads, attachments, signatures, form_fields, links
- Added `write_frame<W: Write>()` helper function:
- Serializes frame to JSON
- Writes trailing newline
- Flushes writer for immediate delivery to streaming consumers
- Added `#[serde(default)]` to optional fields for proper deserialization:
- `PageFrame.annotations`, `PageFrame.errors`
- `FooterFrame.threads`, `FooterFrame.attachments`, `FooterFrame.signatures`, `FooterFrame.form_fields`, `FooterFrame.links`
### Module Exports (`crates/pdftract-core/src/output/ndjson/mod.rs`)
- Updated exports to include `NdjsonFrame` and `write_frame`
### Tests (`crates/pdftract-core/src/output/ndjson/frames.rs`)
- `test_ndjson_frame_header_discriminator`: Verifies "frame":"header" appears first
- `test_ndjson_frame_page_discriminator`: Verifies "frame":"page" appears first
- `test_ndjson_frame_footer_discriminator`: Verifies "frame":"footer" appears first
- `test_write_frame_includes_newline_and_flush`: Verifies write_frame behavior
- `test_roundtrip_header_frame`: Header serialization → deserialization → equality
- `test_roundtrip_page_frame`: Page serialization → deserialization → equality
- `test_roundtrip_footer_frame`: Footer serialization → deserialization → equality
- `test_page_frame_with_empty_collections`: Empty arrays preserved, empty annotations skipped
## Design Decisions
1. **Serde internal tagging**: Used `#[serde(tag = "frame")]` on the enum instead of per-struct fields. This ensures the "frame" key appears first in JSON output and is the standard serde pattern for discriminated unions.
2. **Removed `to_json_line()` methods**: Kept these methods on individual structs for backward compatibility, but the primary API is now `write_frame()` with `NdjsonFrame`.
3. **`#[serde(default)]` on optional fields**: Required for proper roundtrip deserialization since empty collections are skipped during serialization.
## Acceptance Criteria
- [PASS] Roundtrip unit test: write HeaderFrame → parse → equal to original
- [PASS] Frame discriminator order: serialize Page frame → first key is "frame":"page"
- [PASS] Three frames emitted in expected sequence (existing tests verify)
- [PASS] Frame-by-frame writer respects flush after every frame (`write_frame` calls `flush()`)
## Files Modified
- `crates/pdftract-core/src/output/ndjson/frames.rs` - Added NdjsonFrame enum, write_frame helper, updated tests
- `crates/pdftract-core/src/output/ndjson/mod.rs` - Updated exports
## Commit
- `fa57ab3` - feat(pdftract-2kpm0): implement NdjsonFrame enum with internal-tag discriminator and write_frame helper