From 92b0643331a7f1f8a4e988edae2a499f7e0715b2 Mon Sep 17 00:00:00 2001 From: jedarden Date: Mon, 25 May 2026 11:24:53 -0400 Subject: [PATCH] docs(pdftract-2kpm0): add verification note --- notes/pdftract-2kpm0.md | 67 +++++++++++++++++++++++++++++++++++++++++ 1 file changed, 67 insertions(+) create mode 100644 notes/pdftract-2kpm0.md diff --git a/notes/pdftract-2kpm0.md b/notes/pdftract-2kpm0.md new file mode 100644 index 0000000..206189b --- /dev/null +++ b/notes/pdftract-2kpm0.md @@ -0,0 +1,67 @@ +# Verification Note: pdftract-2kpm0 + +## Summary + +Implemented NDJSON frame types with unified `NdjsonFrame` enum using serde internal tagging and `write_frame` helper function. + +## Changes Made + +### Core Implementation (`crates/pdftract-core/src/output/ndjson/frames.rs`) + +- Added `NdjsonFrame` enum with serde internal tagging (`#[serde(tag = "frame", rename_all = "lowercase")]`) + - `NdjsonFrame::Header(HeaderFrame)` + - `NdjsonFrame::Page(PageFrame)` + - `NdjsonFrame::Footer(FooterFrame)` + +- Updated frame structs to remove `frame_type` field (now handled by enum tagging): + - `HeaderFrame`: schema_version, metadata, outline, total_pages + - `PageFrame`: page_index, page_type, spans, blocks, tables, annotations, errors + - `FooterFrame`: extraction_quality, errors, threads, attachments, signatures, form_fields, links + +- Added `write_frame()` helper function: + - Serializes frame to JSON + - Writes trailing newline + - Flushes writer for immediate delivery to streaming consumers + +- Added `#[serde(default)]` to optional fields for proper deserialization: + - `PageFrame.annotations`, `PageFrame.errors` + - `FooterFrame.threads`, `FooterFrame.attachments`, `FooterFrame.signatures`, `FooterFrame.form_fields`, `FooterFrame.links` + +### Module Exports (`crates/pdftract-core/src/output/ndjson/mod.rs`) + +- Updated exports to include `NdjsonFrame` and `write_frame` + +### Tests (`crates/pdftract-core/src/output/ndjson/frames.rs`) + +- `test_ndjson_frame_header_discriminator`: Verifies "frame":"header" appears first +- `test_ndjson_frame_page_discriminator`: Verifies "frame":"page" appears first +- `test_ndjson_frame_footer_discriminator`: Verifies "frame":"footer" appears first +- `test_write_frame_includes_newline_and_flush`: Verifies write_frame behavior +- `test_roundtrip_header_frame`: Header serialization → deserialization → equality +- `test_roundtrip_page_frame`: Page serialization → deserialization → equality +- `test_roundtrip_footer_frame`: Footer serialization → deserialization → equality +- `test_page_frame_with_empty_collections`: Empty arrays preserved, empty annotations skipped + +## Design Decisions + +1. **Serde internal tagging**: Used `#[serde(tag = "frame")]` on the enum instead of per-struct fields. This ensures the "frame" key appears first in JSON output and is the standard serde pattern for discriminated unions. + +2. **Removed `to_json_line()` methods**: Kept these methods on individual structs for backward compatibility, but the primary API is now `write_frame()` with `NdjsonFrame`. + +3. **`#[serde(default)]` on optional fields**: Required for proper roundtrip deserialization since empty collections are skipped during serialization. + +## Acceptance Criteria + +- [PASS] Roundtrip unit test: write HeaderFrame → parse → equal to original +- [PASS] Frame discriminator order: serialize Page frame → first key is "frame":"page" +- [PASS] Three frames emitted in expected sequence (existing tests verify) +- [PASS] Frame-by-frame writer respects flush after every frame (`write_frame` calls `flush()`) + +## Files Modified + +- `crates/pdftract-core/src/output/ndjson/frames.rs` - Added NdjsonFrame enum, write_frame helper, updated tests +- `crates/pdftract-core/src/output/ndjson/mod.rs` - Updated exports + +## Commit + +- `fa57ab3` - feat(pdftract-2kpm0): implement NdjsonFrame enum with internal-tag discriminator and write_frame helper