The emit! macro expects diagnostic codes without the DiagCode:: prefix.
Changed three occurrences in codespace.rs:
- Line 281: DiagCode::CmapInvalidCodespace → CmapInvalidCodespace
- Line 290: DiagCode::CmapInvalidCodespace → CmapInvalidCodespace
- Line 412: DiagCode::CmapInvalidCodespace → CmapInvalidCodespace
This fixes compilation errors that prevented the codebase from building.
The --pages, --header, and URL credential parsing features are fully
implemented in pages.rs, header.rs, and url.rs modules with comprehensive
tests and integration in main.rs, grep/mod.rs, and hash.rs.
References: pdftract-25igv, notes/pdftract-25igv.md
Add call-site diagnostic emission for DCTDecode SOI/EOI marker validation.
Previously, DCTDecoder.validate_markers() created diagnostics but they were
dropped because StreamDecoder trait doesn't support returning them. Now
diagnostics are emitted in decode_stream_impl() like JBIG2/JPX/CCITT.
Also include source module refactoring:
- Add PdfSource adapter trait for source::PdfSource compatibility
- Feature-gate http_range module with `remote` feature
- Update document.rs to use new source traits
Acceptance criteria:
- DCTDecode emits STREAM_INVALID_JPEG for missing SOI/EOI markers
- JBIG2Decode emits OCR_JBIG2_UNSUPPORTED when full-render disabled
- JPXDecode emits OCR_JPX_UNSUPPORTED and validates JP2 magic
- CCITTFaxDecode emits OCR_CCITT_UNSUPPORTED when no libtiff
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Bead-Id: pdftract-4xmp6
Bead-Id: pdftract-57np8
Bead-Id: pdftract-3954u
The --header CLI flag implementation was already complete in the codebase.
This note documents the implementation and verifies all acceptance criteria.
Acceptance criteria verified:
- Single header with URL: PASS
- Multiple headers: PASS
- Managed header rejection: PASS
- CRLF injection protection: PASS
- No colon error: PASS
- Local file silent ignore: PASS
No new code was required - the feature was already fully implemented
in main.rs, header.rs, source/mod.rs, and http_range.rs.
Documents the implementation, acceptance criteria status, and design
decisions for the CMap codespace range parser.
Co-Authored-By: Claude Code <noreply@anthropic.com>
- Remove unused jpx::JpxDecoder import from stream.rs (code uses fully qualified paths)
- Add notes/pdftract-36glh.md with acceptance criteria verification
The JPXDecode passthrough implementation was already complete in commit 4ba4687.
This change is minor cleanup only.
References: pdftract-36glh
This commit fixes a compilation error in the javascript tests that were
using PageDict::default(). The JBIG2 decoder module was already fully
implemented; this change only enables the tests to compile and run.
Changes:
- Add Default impl for PageDict in parser/pages.rs
- Verify all 11 JBIG2-related tests pass
The JBIG2Decode passthrough filter implementation is complete:
- Passthrough of raw JBIG2 bytes
- /JBIG2Globals reference recording for downstream consumers
- OCR_JBIG2_UNSUPPORTED diagnostic emission when full-render disabled
Co-Authored-By: Claude Code <noreply@anthropic.com>
- Made map_error_to_exit_code() function public in hash.rs so it can be
called from main.rs
- Added test file test_hash_exit_codes.rs to verify exit code behavior
- Updated verification note with current implementation status
The hash subcommand was already implemented but map_error_to_exit_code
was private, causing a compilation error. This fix resolves the issue.
Related: pdftract-3954u
Implement O'Gorman 1993 Docstrum algorithm for reading order detection
on irregular layouts (magazines with sidebars) where XY-cut produces
fragmented regions.
Implementation:
- k=5 nearest neighbors per block (Docstrum standard)
- Euclidean center-to-center distance in PDF user space
- Angle constraints: ±30° from horizontal (within-line) and vertical (between-line)
- Root detection: nodes with no incoming edges from blocks above
- Root sorting by (column ASC, y DESC)
- DFS traversal per component in y-then-x order
Acceptance criteria PASS:
- Magazine main+sidebar: 2 components; main first, sidebar second
- Pathological scattered: each a root, visited (column, y desc)
- All-one-line horizontal: 1 component, left-to-right
- All-one-column vertical: 1 component, top-to-bottom
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
The detect_xfa function was already implemented in the codebase at the
time of bead assignment. This note documents the verification of the
existing implementation against the bead's acceptance criteria.
All 6 tests pass, covering all acceptance criteria:
- XFA stream presence → true
- XFA array packet form → true
- No XFA key → false
- XFA null → false
- No AcroForm → false
- XFA as indirect reference → true
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Verification confirms the CLI parsing and validation for multi-format
output flags is already fully implemented in crates/pdftract-cli/src/output.rs.
All acceptance criteria verified:
- Duplicate format rejection ✓
- NDJSON exclusivity ✓
- At most one stdout ✓
- Auto-naming with --format + -o ✓
No code changes required.
Update the verification note for pdftract-2qw5j to clarify that the
bead's "Critical considerations" enum values differ from the actual
implementation:
- confidence_source: bead lists ["vector", "ocr", ...] but plan/Rust
code uses ["native", "heuristic", "ocr"] (per plan line 363)
- severity: bead omits "fatal" but Rust code includes it for
extraction-aborting conditions
The schema generation system is complete and correct per the plan
specification. The bead requirements appear to be from an earlier
spec version and are superseded by the plan.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
- Add std::sync::Arc import for thread sharing
- Fix lifetime issue in test_sync_multiple_threads using Arc
- Add mut to source in test_empty_file for Read trait
All FileSource tests pass (12/12).
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
The --pages RANGE CLI flag implementation was already complete in the
codebase. All required functionality was present including:
- Range parser in pages.rs with comprehensive tests
- CLI integration in main.rs
- HTTP serve support in serve.rs
- MCP tools integration
- PyO3 bindings in pdftract-py
All acceptance criteria verified PASS.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Implement FileSource as a PdfSource fallback for when memory-mapping
is not available or desired. Uses parking_lot::Mutex<File> for
thread-safe concurrent access across rayon workers.
Changes:
- Add parking_lot = "0.12" dependency to pdftract-core/Cargo.toml
- Rewrite FileSource to use Mutex<File> for Send + Sync support
- Implement PdfSource, Read, and Seek traits
- Add 12 comprehensive tests including concurrent read tests
All tests pass. Thread-safe concurrent access verified via
test_sync_multiple_threads and test_concurrent_read_range.
Co-Authored-By: Claude Code (claude-opus-4.7) <noreply@anthropic.com>
Bead-Id: pdftract-5ik66
- Implemented aes_128_decrypt with CBC mode + PKCS#7 padding
- Implemented derive_aes_128_object_key with 'sAlT' suffix
- Implemented is_identity_filter for crypt filter handling
- All 11 unit tests passing
- Integration work deferred to coordinator bead pdftract-1z0qt
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Define the PdfSource trait abstraction over PDF byte sources. This trait
provides a uniform API for reading PDF data from different sources:
local files (MmapSource, FileSource), and eventually remote HTTPS PDFs.
Trait features:
- Read + Seek + Send + Sync supertrait bounds for rayon page-parallelism
- len() returns total source length
- read_range() returns Bytes for zero-copy slicing
- prefetch() with no-op default (MmapSource overrides for MADV_SEQUENTIAL)
MmapSource:
- Memory-mapped file access via memmap2
- Applies MADV_SEQUENTIAL advice via prefetch()
- Zero-copy read_range() using Bytes::copy_from_slice()
- Fallback for platforms/filesystems where mmap fails
FileSource:
- Standard I/O implementation using std::fs::File
- Read+Seek delegation to underlying File
- read_range() uses try_clone() for thread-safe concurrent access
Re-exports from pdftract-core::source::PdfSource.
Verification note: notes/pdftract-1mmq9.md documents completion status.
Parser module migration to use new PdfSource is deferred to follow-up.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Blocker identified:
- Extraction pipeline (extract.rs) doesn't use Phase 4 layout pipeline
- Column detection functions never called in production
- SpanJson.column hardcoded to None (lines 1059, 1916)
- No end-to-end tests for acceptance criteria
Span struct HAS column field (line 179) but extraction doesn't use it.
Coordinator CANNOT CLOSE - sub-phase not end-to-end functional.
The encrypt_padded_mut API requires the buffer to be large enough to
hold the padded ciphertext. The tests were using plaintext.to_vec() which
only allocated plaintext.len() bytes, insufficient for padding.
Changed pattern:
- Before: plaintext.to_vec() (insufficient space)
- After: vec![0u8; plaintext.len() + 16] with copy_from_slice
Also fixed incorrect usage: encrypt_padded_mut returns Result<(), Error>,
not a length. Use data_copy.len() directly for ciphertext length.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
This bead asked for implementation of BMC/BDC/EMC marked-content
operators and MarkedContentStack, but these were already fully
implemented in the codebase with comprehensive test coverage.
Verification note documents:
- MarkedContentStack in marked_content_stack.rs
- BMC/BDC/EMC parsers in marked_content_operators.rs
- Integration into execute_with_do in content_stream.rs
- All 6 acceptance criteria covered by passing tests
- 57 marked-content tests all passing
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Fixed compilation errors in Span constructors by adding missing `column: None` field.
Verified that the existing multi-output CLI parsing implementation meets all
acceptance criteria for bead pdftract-37qim.
Changes:
- crates/pdftract-core/src/span/mod.rs: Add column field to new() and empty() constructors
Verification:
- All 23 output::tests pass
- CLI parsing validated for duplicate format detection, ndjson exclusivity, stdout uniqueness
- Format auto-naming (--format with -o) works correctly
- Default behavior (no flags -> JSON to stdout) confirmed
See notes/pdftract-37qim.md for detailed verification results.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
All 4 children beads closed with verification:
- Line struct + baseline computation (pdftract-sdx9z)
- Baseline clustering algorithm (pdftract-6bwq4)
- Within-line span sorting (pdftract-1jkme)
- RTL direction detection (pdftract-1ofnz)
Acceptance criteria:
- ✅ All 4 children closed
- ✅ Two-column layout: columns NOT merged into one line (test_two_column_separate_blocks)
- ✅ Superscript span at higher y: clustered with baseline text
- ✅ Arabic text: bidi R characters detected, spans sorted right-to-left
- ✅ Mixed Latin+Arabic line: detected as "mixed" direction
44/44 tests pass in layout::line module.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
- Add map_confidence_source to confidence module re-exports in lib.rs
- Remove duplicate map_confidence_source function from span/mod.rs
- Add Ocr case to map_unicode_source_to_confidence helper
- Add comprehensive tests for map_confidence_source in span module
The ConfidenceSource enum and map_confidence_source function were already
implemented in the confidence module from bead pdftract-2etcd. This change
completes the public API exposure and removes the duplicate implementation.
Acceptance criteria (all PASS):
- Single-glyph to_unicode span: confidence_source == Native
- Single-glyph shape_match span: confidence_source == Heuristic
- Mixed-glyph span (agl + shape_match): confidence_source == Heuristic
- 4.7 correction applied: Native -> Heuristic override
- OCR span: confidence_source == Ocr
- JSON serialization: lowercase strings
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
The OutputOptions struct with block-kind filtering and CLI flags
was already implemented in the codebase. All 8 acceptance criteria
tests pass.
- Struct defined in pdftract-core/src/options.rs
- CLI flags wired in pdftract-cli/src/main.rs
- Tests: default values, block kind filtering, span filtering
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
The map_confidence_source function was already implemented in
crates/pdftract-core/src/confidence.rs with comprehensive tests.
All acceptance criteria PASS:
- Unit tests for all 12 (UnicodeSource, corrected) combinations
- ToUnicode + corrected=true correctly downgrades to Heuristic
- Ocr is unaffected by correction flag
- Exhaustive match enforces compiler completeness
- INV-9 mapping table documented
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
The invisible text filter in serialize_page_text() was always recomputing
block text from spans, but when block.spans is empty (no span data available),
this produced empty text for all blocks. Added fallback to use pre-computed
block.text when span data is missing, maintaining backward compatibility.
Also added special case for figure blocks to always emit empty text regardless
of span data.
All 111 text module tests pass, including all invisible text filtering tests
for Tr=0-7 and include_invisible=true/false combinations.
Acceptance criteria PASS:
- rendering_mode 3 excluded by default: ✓
- rendering_mode 3 included when flagged: ✓
- Mixed block emits visible: ✓
- All-invisible block produces empty (no spurious \n\n): ✓
- Tr=4 treated same as Tr=3: ✓
Closes pdftract-38p8h
The MarkedContentStack implementation was already complete.
All 45 tests pass (20 stack tests + 25 operator parser tests).
Acceptance criteria:
- push_bmc 64 times → all push; 65th emits MARKED_CONTENT_DEPTH_EXCEEDED ✅
- push_bmc N then pop_emc N → empty stack ✅
- pop_emc on empty stack → EmcUnderflow diagnostic ✅
- top_mcid returns Some(mcid) when top has MCID; None when empty ✅
- Unit tests cover push/pop balance, overflow, underflow ✅
- INV-8 (no panic) verified on all stack operations ✅
See notes/pdftract-1qoeb.md for details.
The ConfidenceSource enum was already fully implemented with:
- Three variants (Native, Heuristic, Ocr) with lowercase serde
- Hash derive for HashMap usage
- Module docstring citing INV-9 stable taxonomy
- Public re-export in lib.rs
- All 4 tests passing
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
The repair_split_ligatures function was previously implemented in
commit 8cfbe70 as part of pdftract-1jkme. This verification note
documents the implementation and confirms all acceptance criteria
are met.
Acceptance criteria:
- U+FFFD adjacent to 'i', gap 0.05pt: repaired to "fi"/"ffi" by shape
- U+FFFD with no nearby f/l/i: not repaired
- U+FFFD adjacent to 'f': shape match disambiguates ffi/ffl/fi
- Multiple U+FFFD in span: each evaluated
- Returns true on any repair
All criteria PASS.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>