Commit graph

544 commits

Author SHA1 Message Date
jedarden
db92403bd5 chore(pdftract-36glh): remove unused JpxDecoder import and add verification note
Some checks are pending
Schema Generation Validation / Validate JSON Schema (push) Waiting to run
Schema Generation Validation / Validate JSON Syntax (push) Waiting to run
- Remove unused jpx::JpxDecoder import from stream.rs (code uses fully qualified paths)
- Add notes/pdftract-36glh.md with acceptance criteria verification

The JPXDecode passthrough implementation was already complete in commit 4ba4687.
This change is minor cleanup only.

References: pdftract-36glh
2026-05-28 05:23:13 -04:00
jedarden
4ba4687a36 feat(pdftract-36glh): implement JPXDecode passthrough with JP2 validation
Some checks are pending
Schema Generation Validation / Validate JSON Schema (push) Waiting to run
Schema Generation Validation / Validate JSON Syntax (push) Waiting to run
Implements JPEG2000 (JPX) passthrough filter per Phase 1.5:

- JP2 box magic validation (12-byte signature check)
- STREAM_INVALID_JPX diagnostic for raw J2K/corrupt data
- OCR_JPX_UNSUPPORTED diagnostic when full-render+libopenjp2 unavailable
- Runtime libopenjp2 detection (pkg-config + ldconfig fallback)
- Passthrough behavior (raw bytes unchanged)

Module: crates/pdftract-core/src/decoder/jpx.rs
Stream integration: JpxStreamDecoder in parser/stream.rs

Acceptance criteria:
- JP2-wrapped JPX with full-render → passthrough, no diagnostic
- JP2-wrapped JPX without full-render → OCR_JPX_UNSUPPORTED
- Raw J2K codestream → STREAM_INVALID_JPX + passthrough
- Round-trip test coverage (unit tests validate JP2 signature)

Per plan EC-12: emits diagnostic when neither full-render nor
libopenjp2 is available, alerting Phase 5.2 OCR pipeline.

Co-Authored-By: Claude Code <noreply@anthropic.com>
2026-05-28 05:11:19 -04:00
jedarden
b8a1b8f193 fix(pdftract-2sswr): add Default impl for PageDict to fix JBIG2 compilation
Some checks are pending
Schema Generation Validation / Validate JSON Schema (push) Waiting to run
Schema Generation Validation / Validate JSON Syntax (push) Waiting to run
This commit fixes a compilation error in the javascript tests that were
using PageDict::default(). The JBIG2 decoder module was already fully
implemented; this change only enables the tests to compile and run.

Changes:
- Add Default impl for PageDict in parser/pages.rs
- Verify all 11 JBIG2-related tests pass

The JBIG2Decode passthrough filter implementation is complete:
- Passthrough of raw JBIG2 bytes
- /JBIG2Globals reference recording for downstream consumers
- OCR_JBIG2_UNSUPPORTED diagnostic emission when full-render disabled

Co-Authored-By: Claude Code <noreply@anthropic.com>
2026-05-28 04:44:45 -04:00
jedarden
2af3b0aeea fix(pdftract-3954u): make map_error_to_exit_code public in hash module
- Made map_error_to_exit_code() function public in hash.rs so it can be
  called from main.rs
- Added test file test_hash_exit_codes.rs to verify exit code behavior
- Updated verification note with current implementation status

The hash subcommand was already implemented but map_error_to_exit_code
was private, causing a compilation error. This fix resolves the issue.

Related: pdftract-3954u
2026-05-28 04:44:45 -04:00
jedarden
06079a16b2 feat(pdftract-4bylb): implement Docstrum fallback for reading order
Some checks are pending
Schema Generation Validation / Validate JSON Schema (push) Waiting to run
Schema Generation Validation / Validate JSON Syntax (push) Waiting to run
Implement O'Gorman 1993 Docstrum algorithm for reading order detection
on irregular layouts (magazines with sidebars) where XY-cut produces
fragmented regions.

Implementation:
- k=5 nearest neighbors per block (Docstrum standard)
- Euclidean center-to-center distance in PDF user space
- Angle constraints: ±30° from horizontal (within-line) and vertical (between-line)
- Root detection: nodes with no incoming edges from blocks above
- Root sorting by (column ASC, y DESC)
- DFS traversal per component in y-then-x order

Acceptance criteria PASS:
- Magazine main+sidebar: 2 components; main first, sidebar second
- Pathological scattered: each a root, visited (column, y desc)
- All-one-line horizontal: 1 component, left-to-right
- All-one-column vertical: 1 component, top-to-bottom

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-28 04:16:24 -04:00
jedarden
35f5ac9594 docs(pdftract-2cnmr): add verification note for PdfSource trait implementation
Some checks are pending
Schema Generation Validation / Validate JSON Schema (push) Waiting to run
Schema Generation Validation / Validate JSON Syntax (push) Waiting to run
2026-05-28 03:50:05 -04:00
jedarden
a65cae14a8 feat(pdftract-2bs4j): implement PDF/A conformance detection via XMP parsing
Some checks are pending
Schema Generation Validation / Validate JSON Schema (push) Waiting to run
Schema Generation Validation / Validate JSON Syntax (push) Waiting to run
- Add detect_conformance() to parse pdfaid:part and pdfaid:conformance from XMP /Metadata stream
- Support all PDF/A levels: 1a/b, 2a/b/u/f, 3a/b/u/f, 4e/f
- Namespace-agnostic matching handles any prefix (pdfaid, x, foo, etc.)
- Graceful failure: malformed XML returns None (INV-8 compliant)
- quick-xml already in default dependencies (line 46 of Cargo.toml)
- 15 comprehensive tests covering all acceptance criteria

Acceptance criteria status:
- PDF/A-1b, 2u, 3a, 4e, 4f detection: PASS
- Part-only detection: PASS
- No metadata/malformed XML: PASS
- Different namespace prefixes: PASS

Verification note: notes/pdftract-2bs4j.md

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-28 03:36:59 -04:00
jedarden
a0bdefb010 docs(pdftract-342k4): add verification note for XFA detection
Some checks are pending
Schema Generation Validation / Validate JSON Schema (push) Waiting to run
Schema Generation Validation / Validate JSON Syntax (push) Waiting to run
The detect_xfa function was already implemented in the codebase at the
time of bead assignment. This note documents the verification of the
existing implementation against the bead's acceptance criteria.

All 6 tests pass, covering all acceptance criteria:
- XFA stream presence → true
- XFA array packet form → true
- No XFA key → false
- XFA null → false
- No AcroForm → false
- XFA as indirect reference → true

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-28 03:36:57 -04:00
jedarden
17bfa273b0 docs(pdftract-37qim): add verification note for CLI multi-output parsing
Some checks are pending
Schema Generation Validation / Validate JSON Schema (push) Waiting to run
Schema Generation Validation / Validate JSON Syntax (push) Waiting to run
Verification confirms the CLI parsing and validation for multi-format
output flags is already fully implemented in crates/pdftract-cli/src/output.rs.

All acceptance criteria verified:
- Duplicate format rejection ✓
- NDJSON exclusivity ✓
- At most one stdout ✓
- Auto-naming with --format + -o ✓

No code changes required.
2026-05-28 03:22:47 -04:00
jedarden
f9b3cbee76 docs(pdftract-2vd1y): verify JavaScript detection implementation
Some checks are pending
Schema Generation Validation / Validate JSON Schema (push) Waiting to run
Schema Generation Validation / Validate JSON Syntax (push) Waiting to run
The JavaScript presence detection module was already complete in
crates/pdftract-core/src/javascript.rs. Verified all acceptance criteria:

- Catalog /OpenAction /S /JavaScript → detected
- Page /AA /O /S /JS → detected
- AcroForm field /AA /K /S /JavaScript → detected
- Annotation /A /S /JavaScript → detected
- /Next-chained actions → detected
- Cyclic /Next → bounded by visited set
- No JS present → returns false

All 16 JavaScript tests pass. Created verification note documenting
the implementation.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-28 03:22:36 -04:00
jedarden
851439c6b1 docs(pdftract-4cpo8): add verification note for block-kind markdown dispatch
The block-kind to Markdown emission dispatch is already fully implemented
in crates/pdftract-core/src/markdown.rs. All acceptance criteria are met:
- Heading H1: "# Title\n\n"
- Paragraph soft breaks: "  \n" markers
- Nested lists: 2-space indentation
- Numbered lists: preserves source numbering
- Code fences: language detection
- Inline/display formulas: $/915571 delimiters
- Table: GFM pipe tables with HTML fallback
- Include/exclude: header/footer/watermark filtering

100+ test cases cover all block kinds and edge cases.
2026-05-28 03:22:36 -04:00
jedarden
a62913f25d feat(pdftract-1z0qt): implement encryption detection + RC4/AES-128/AES-256 decryption
Implement decrypt feature with RC4, AES-128, and AES-256 decryption support
for encrypted PDFs per PDF 1.7/2.0 spec.

Core components:
- detection.rs: Parse /Encrypt dictionary, validate encryption metadata
- rc4.rs: V=1 R=2 (40-bit) and V=2 R=3 (40-128 bit) key derivation
- aes_128.rs: V=4 R=4 AES-128 CBC with PKCS#7 padding
- aes_256.rs: V=5 R=5/6 AES-256 with SHA-256/384/512 key derivation
- decryptor.rs: Unified API for password validation and stream/string decryption

Integration:
- extract_pdf: Detect encryption and validate passwords after xref loading
- CLI: Exit code 3 for encryption errors (wrong password, unsupported)
- Password sources: --password-stdin, PDFTRACT_PASSWORD, --password VALUE (opt-in)

Password validation: Empty string first, then user-provided. Wrong
password emits ENCRYPTION_UNSUPPORTED diagnostic and exits with code 3.

Tests: Unit tests for RC4, AES-128, AES-256 key derivation and
validation. All pass with `cargo test --features decrypt`.

Refs: Plan Phase 1.4 line 1114, EC-04/EC-05/EC-06, PDF spec 7.6

Co-Authored-By: Claude Code <noreply@anthropic.com>
2026-05-28 03:22:36 -04:00
jedarden
5a9648f404 docs(pdftract-2qw5j): clarify enum value discrepancy in verification note
Some checks are pending
Schema Generation Validation / Validate JSON Schema (push) Waiting to run
Schema Generation Validation / Validate JSON Syntax (push) Waiting to run
Update the verification note for pdftract-2qw5j to clarify that the
bead's "Critical considerations" enum values differ from the actual
implementation:

- confidence_source: bead lists ["vector", "ocr", ...] but plan/Rust
  code uses ["native", "heuristic", "ocr"] (per plan line 363)
- severity: bead omits "fatal" but Rust code includes it for
  extraction-aborting conditions

The schema generation system is complete and correct per the plan
specification. The bead requirements appear to be from an earlier
spec version and are superseded by the plan.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-28 02:52:12 -04:00
jedarden
23322f79d1 feat(pdftract-2qw5j): add explicit enum constraints to JSON Schema
Some checks are pending
Schema Generation Validation / Validate JSON Schema (push) Waiting to run
Schema Generation Validation / Validate JSON Syntax (push) Waiting to run
Add explicit enum constraints to page_type, severity, and confidence_source
fields in the generated JSON Schema for better validation.

Changes:
- Modified xtask/src/bin/gen_schema.rs to add explicit enum constraints
  during schema generation via add_enum_constraints() function
- page_type enum: ["text", "scanned", "mixed", "broken_vector", "blank", "figure_only"]
- severity enum: ["info", "warning", "error", "fatal"]
- confidence_source enum: ["native", "heuristic", "ocr"]
- Regenerated docs/schema/v1.0/pdftract.schema.json with enum constraints
- Added .github/workflows/schema-gen.yml CI workflow for schema validation

The CI workflow validates:
1. Generated schema matches committed file (fails on diff)
2. JSON syntax is valid
3. Schema structure is correct ($id, $schema, title, $defs)
4. Enum constraints are present and have correct values

This ensures schema changes are reviewable in PRs and forces
developers to commit the updated schema when type definitions change.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-28 02:47:54 -04:00
jedarden
ede9bebb8d docs(pdftract-2qw5j): add verification note for schema generation
Verified that the JSON schema generation system is fully implemented:
- xtask gen-schema produces valid JSON Schema Draft 2020-12
- Committed schema matches generated output (no diffs)
- CI gate enforces schema sync (quality-matrix/schema-gen template)
- All required enum values present (page_type with broken_vector, confidence_source, severity)
- Schema metadata correct ($id, $schema, title, description)

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-28 02:31:33 -04:00
jedarden
ba5d101840 test(pdftract-1uhee): fix MmapSource test assertions
- test_open_valid_file: byte string is 22 bytes, not 20
- test_seek_from_end: seeking -2 from end of "Hello" gives "lo", not "el"

The MmapSource implementation was already complete with all acceptance
criteria met:
- open() returns Ok/Err appropriately
- read_range() with bounds checking
- len() matches file size
- Read+Seek trait implementations
- Send + Sync for concurrent access
- MADV_SEQUENTIAL via advise_sequential()

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-28 02:29:42 -04:00
jedarden
ae9e478405 docs(pdftract-2qw5j): regenerate JSON schema from updated Rust types
The schema now reflects the latest doc comments from the Rust types,
including updated descriptions for annotations and other fields.

Changes:
- AnnotationJson description updates (phase 7.6.4 reference)
- Format consistency updates (float vs double)
- Subtype-specific field documentation

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-28 02:25:00 -04:00
jedarden
502fc153e4 docs(pdftract-16h0a): update verification note
Update verification note to reflect completed implementation.
All acceptance criteria PASS.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-28 02:21:23 -04:00
jedarden
7b288ce234 ci(pdftract-16h0a): add schema-gen CI gate
Add schema-gen step to quality-matrix that regenerates
docs/schema/v1.0/pdftract.schema.json and compares to committed file.
Fails build on any diff with actionable error message.

Bead: pdftract-16h0a (Phase 6.1.3)
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-28 02:20:46 -04:00
jedarden
823712d65c fix(pdftract-1psmn): fix mmap test compilation errors
- Add std::sync::Arc import for thread sharing
- Fix lifetime issue in test_sync_multiple_threads using Arc
- Add mut to source in test_empty_file for Read trait

All FileSource tests pass (12/12).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-28 02:19:44 -04:00
jedarden
a2da014936 docs(pdftract-2wdjp): add verification note for pages range flag
The --pages RANGE CLI flag implementation was already complete in the
codebase. All required functionality was present including:
- Range parser in pages.rs with comprehensive tests
- CLI integration in main.rs
- HTTP serve support in serve.rs
- MCP tools integration
- PyO3 bindings in pdftract-py

All acceptance criteria verified PASS.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-28 02:13:01 -04:00
jedarden
4702ecc66f feat(pdftract-1psmn): implement FileSource with parking_lot::Mutex
Implement FileSource as a PdfSource fallback for when memory-mapping
is not available or desired. Uses parking_lot::Mutex<File> for
thread-safe concurrent access across rayon workers.

Changes:
- Add parking_lot = "0.12" dependency to pdftract-core/Cargo.toml
- Rewrite FileSource to use Mutex<File> for Send + Sync support
- Implement PdfSource, Read, and Seek traits
- Add 12 comprehensive tests including concurrent read tests

All tests pass. Thread-safe concurrent access verified via
test_sync_multiple_threads and test_concurrent_read_range.

Co-Authored-By: Claude Code (claude-opus-4.7) <noreply@anthropic.com>
Bead-Id: pdftract-5ik66
2026-05-28 02:13:01 -04:00
jedarden
6f55c8e188 docs(pdftract-495uv): add verification note for AES-128 decryption implementation
- Implemented aes_128_decrypt with CBC mode + PKCS#7 padding
- Implemented derive_aes_128_object_key with 'sAlT' suffix
- Implemented is_identity_filter for crypt filter handling
- All 11 unit tests passing
- Integration work deferred to coordinator bead pdftract-1z0qt

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-28 02:04:56 -04:00
jedarden
5f9666f9b0 docs(pdftract-37qim): verify CLI parsing + validation for multi-output
Verification of bead pdftract-37qim. All acceptance criteria PASS:

- --json a.json --md b.md -> 2 OutputSpecs built
- --json a.json --json b.json -> duplicate format error
- --ndjson --md b.md -> cannot be combined error (critical test)
- --md - --json out.json -> 2 specs, MD=Stdout, JSON=File
- --md - --json - -> at most one stdout error
- --format json,md -o out -> 2 specs, out.json + out.md

Implementation was already complete in crates/pdftract-cli/src/output.rs.
Verified with both unit tests (23/23 pass) and manual CLI testing.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-28 02:04:50 -04:00
jedarden
f106b5df02 feat(pdftract-1mmq9): add PdfSource trait with MmapSource and FileSource implementations
Define the PdfSource trait abstraction over PDF byte sources. This trait
provides a uniform API for reading PDF data from different sources:
local files (MmapSource, FileSource), and eventually remote HTTPS PDFs.

Trait features:
- Read + Seek + Send + Sync supertrait bounds for rayon page-parallelism
- len() returns total source length
- read_range() returns Bytes for zero-copy slicing
- prefetch() with no-op default (MmapSource overrides for MADV_SEQUENTIAL)

MmapSource:
- Memory-mapped file access via memmap2
- Applies MADV_SEQUENTIAL advice via prefetch()
- Zero-copy read_range() using Bytes::copy_from_slice()
- Fallback for platforms/filesystems where mmap fails

FileSource:
- Standard I/O implementation using std::fs::File
- Read+Seek delegation to underlying File
- read_range() uses try_clone() for thread-safe concurrent access

Re-exports from pdftract-core::source::PdfSource.

Verification note: notes/pdftract-1mmq9.md documents completion status.
Parser module migration to use new PdfSource is deferred to follow-up.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-28 01:57:25 -04:00
jedarden
c5440d115a fix(pdftract-495uv): AES-128 test buffer allocation for PKCS#7 padding
Fixed test_aes_128_decrypt_roundtrip_with_valid_padding and two similar
tests to use the ciphertext slice returned by encrypt_padded_mut instead of
the entire buffer. The buffer is over-allocated to accommodate padding, but
only the returned slice contains valid ciphertext. Using the entire buffer
included trailing zeros that caused decryption to fail with invalid padding.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-28 01:56:26 -04:00
jedarden
899ee1685b docs(pdftract-5ik66): add Phase 7.8 coordinator verification note
All 10 child beads closed, 74 module tests pass, CLI builds.
WARN: corpus-based performance tests not testable (empty corpus),
missing grep-progress.schema.json (child bead closed anyway).
2026-05-28 01:56:26 -04:00
jedarden
18af6bb01d docs(pdftract-63ka2): update verification note - extraction pipeline missing Phase 4 integration
Blocker identified:
- Extraction pipeline (extract.rs) doesn't use Phase 4 layout pipeline
- Column detection functions never called in production
- SpanJson.column hardcoded to None (lines 1059, 1916)
- No end-to-end tests for acceptance criteria

Span struct HAS column field (line 179) but extraction doesn't use it.
Coordinator CANNOT CLOSE - sub-phase not end-to-end functional.
2026-05-28 01:47:50 -04:00
jedarden
883d7d68b2 docs(pdftract-2k3ms): add verification note for Phase 3.4 Marked Content Tracking coordinator
- Verify all 3 children closed (pdftract-1l6wn, pdftract-64atr, pdftract-1q19p)
- Verify nested BDC: innermost MCID wins (MarkedContentStack::innermost_mcid)
- Verify EMC without BMC: ignored, no panic (pop_emc returns None with diagnostic)
- Verify MCID 0: valid (Option<u32> allows Some(0))
- Verify OCG default OFF: glyphs emitted with is_hidden flag
- Document 68 passing tests (18 stack + 30 operator + 20 OCG)

Closes: pdftract-2k3ms
2026-05-28 01:37:17 -04:00
jedarden
7ffb1a729f fix(pdftract-63ka2): AES-128 test buffer allocation for PKCS#7 padding
The encrypt_padded_mut API requires the buffer to be large enough to
hold the padded ciphertext. The tests were using plaintext.to_vec() which
only allocated plaintext.len() bytes, insufficient for padding.

Changed pattern:
- Before: plaintext.to_vec() (insufficient space)
- After: vec![0u8; plaintext.len() + 16] with copy_from_slice

Also fixed incorrect usage: encrypt_padded_mut returns Result<(), Error>,
not a length. Use data_copy.len() directly for ciphertext length.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-28 01:30:33 -04:00
jedarden
0371815f9b docs(pdftract-1l6wn): verify BMC/BDC/EMC operators already implemented
This bead asked for implementation of BMC/BDC/EMC marked-content
operators and MarkedContentStack, but these were already fully
implemented in the codebase with comprehensive test coverage.

Verification note documents:
- MarkedContentStack in marked_content_stack.rs
- BMC/BDC/EMC parsers in marked_content_operators.rs
- Integration into execute_with_do in content_stream.rs
- All 6 acceptance criteria covered by passing tests
- 57 marked-content tests all passing

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-28 01:29:07 -04:00
jedarden
fa95e9649e fix(pdftract-37qim): fix span compilation errors, verify multi-output CLI parsing
Fixed compilation errors in Span constructors by adding missing `column: None` field.
Verified that the existing multi-output CLI parsing implementation meets all
acceptance criteria for bead pdftract-37qim.

Changes:
- crates/pdftract-core/src/span/mod.rs: Add column field to new() and empty() constructors

Verification:
- All 23 output::tests pass
- CLI parsing validated for duplicate format detection, ndjson exclusivity, stdout uniqueness
- Format auto-naming (--format with -o) works correctly
- Default behavior (no flags -> JSON to stdout) confirmed

See notes/pdftract-37qim.md for detailed verification results.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-28 01:29:07 -04:00
jedarden
9f377d1609 docs(pdftract-53liu): verify Phase 4.2 Line Formation coordinator
All 4 children beads closed with verification:
- Line struct + baseline computation (pdftract-sdx9z)
- Baseline clustering algorithm (pdftract-6bwq4)
- Within-line span sorting (pdftract-1jkme)
- RTL direction detection (pdftract-1ofnz)

Acceptance criteria:
-  All 4 children closed
-  Two-column layout: columns NOT merged into one line (test_two_column_separate_blocks)
-  Superscript span at higher y: clustered with baseline text
-  Arabic text: bidi R characters detected, spans sorted right-to-left
-  Mixed Latin+Arabic line: detected as "mixed" direction

44/44 tests pass in layout::line module.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-28 01:15:31 -04:00
jedarden
96e3cc8a91 docs(pdftract-5g6s5): add verification note for Phase 4.1 coordinator
All 5 child beads verified closed:
- pdftract-31ag5: Span struct definition
- pdftract-3zz9n: 5-trigger break detector + glyph-to-span merger
- pdftract-cbrbg: Span flag detector
- pdftract-1f8we: ConfidenceSource enum + mapping
- pdftract-2c5sx: Span text assembly

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-28 01:12:08 -04:00
jedarden
49859e176f docs(pdftract-1f8we): verify ConfidenceSource enum and mapping implementation
Verified that ConfidenceSource enum and map_confidence_source function
are already fully implemented in crates/pdftract-core/src/confidence.rs.

All acceptance criteria PASS:
- Single-glyph to_unicode → Native
- Single-glyph shape_match → Heuristic
- Mixed-glyph (agl + shape_match) → Heuristic (worst)
- 4.7 correction on all-agl → Heuristic (override)
- OCR-produced span → Ocr
- JSON serialization lowercase

No code changes required - implementation was already complete.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-28 01:10:16 -04:00
jedarden
5a7c25ead4 feat(pdftract-1f8we): add map_confidence_source to public API, remove duplicate from span module
- Add map_confidence_source to confidence module re-exports in lib.rs
- Remove duplicate map_confidence_source function from span/mod.rs
- Add Ocr case to map_unicode_source_to_confidence helper
- Add comprehensive tests for map_confidence_source in span module

The ConfidenceSource enum and map_confidence_source function were already
implemented in the confidence module from bead pdftract-2etcd. This change
completes the public API exposure and removes the duplicate implementation.

Acceptance criteria (all PASS):
- Single-glyph to_unicode span: confidence_source == Native
- Single-glyph shape_match span: confidence_source == Heuristic
- Mixed-glyph span (agl + shape_match): confidence_source == Heuristic
- 4.7 correction applied: Native -> Heuristic override
- OCR span: confidence_source == Ocr
- JSON serialization: lowercase strings

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-28 01:06:02 -04:00
jedarden
fe4dcdeaa8 docs(pdftract-2t1an): add verification note for encryption detection
Bead: pdftract-2t1an

Added verification note documenting the complete implementation of
encryption dictionary detection and EncryptionInfo struct.

All acceptance criteria PASS:
- V=1 R=2 RC4-40 detection (version=1, revision=2, key_length=40)
- V=5 R=6 AES-256 detection (version=5, revision=6, key_length=256)
- Non-Standard filter rejection with ENCRYPTION_UNSUPPORTED
- Invalid /O/U length handling with ENCRYPTION_INVALID_DICT
- Clean handling of missing /Encrypt key
- Unit tests covering all V/R combinations

Test results: 10/10 tests pass

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-28 01:00:22 -04:00
jedarden
6f86258a7a docs(pdftract-2bpzs): add verification note for OutputOptions implementation
The OutputOptions struct with block-kind filtering and CLI flags
was already implemented in the codebase. All 8 acceptance criteria
tests pass.

- Struct defined in pdftract-core/src/options.rs
- CLI flags wired in pdftract-cli/src/main.rs
- Tests: default values, block kind filtering, span filtering

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-28 00:52:55 -04:00
jedarden
3d8dc58541 docs(pdftract-2etcd): add verification note for map_confidence_source implementation
The map_confidence_source function was already implemented in
crates/pdftract-core/src/confidence.rs with comprehensive tests.
All acceptance criteria PASS:
- Unit tests for all 12 (UnicodeSource, corrected) combinations
- ToUnicode + corrected=true correctly downgrades to Heuristic
- Ocr is unaffected by correction flag
- Exhaustive match enforces compiler completeness
- INV-9 mapping table documented

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-28 00:48:48 -04:00
jedarden
b9b4f50ff8 feat(pdftract-2etcd): implement map_confidence_source function
Implement the map_confidence_source(unicode_source: UnicodeSource,
corrected_in_4_7: bool) -> ConfidenceSource function that collapses the
6 internal UnicodeSource variants down to the 3 schema-exposed
ConfidenceSource variants.

- Mapping follows INV-9 stable taxonomy
- Phase 4.7 correction override: corrected Unicode downgrades
  Native -> Heuristic
- OCR is never affected by corrections (corrections apply to vector
  text, not raster OCR output)
- Exhaustive match on UnicodeSource ensures compiler-enforced
  completeness

Acceptance criteria:
- Unit tests for all (UnicodeSource, corrected) combinations PASS
- ToUnicode + corrected=true → Heuristic (override applies)
- Ocr + corrected=true → Ocr (override does NOT apply)
- INV-9 mapping table documented in code comments

Also fixed pre-existing compilation errors in encryption module:
- detection.rs: syntax error in PdfObject::Array construction
- mod.rs: removed duplicate EncryptionInfo struct definition

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-28 00:46:19 -04:00
jedarden
dddf81075f fix(pdftract-38p8h): add fallback for empty block.spans in invisible text filter
The invisible text filter in serialize_page_text() was always recomputing
block text from spans, but when block.spans is empty (no span data available),
this produced empty text for all blocks. Added fallback to use pre-computed
block.text when span data is missing, maintaining backward compatibility.

Also added special case for figure blocks to always emit empty text regardless
of span data.

All 111 text module tests pass, including all invisible text filtering tests
for Tr=0-7 and include_invisible=true/false combinations.

Acceptance criteria PASS:
- rendering_mode 3 excluded by default: ✓
- rendering_mode 3 included when flagged: ✓
- Mixed block emits visible: ✓
- All-invisible block produces empty (no spurious \n\n): ✓
- Tr=4 treated same as Tr=3: ✓

Closes pdftract-38p8h
2026-05-28 00:39:37 -04:00
jedarden
43e2e5a399 docs(pdftract-2bfgc): add sample nginx and Traefik reverse-proxy configs
Add two example reverse-proxy configuration files to help operators
deploy pdftract serve with TLS and authentication in front of the
no-auth pdftract server.

- docs/operations/serve-nginx-example.conf: nginx config with Basic Auth,
  proxy_pass to localhost:8080, /extract and /health endpoints
- docs/operations/serve-traefik-example.yaml: Traefik dynamic config with
  BasicAuth middleware, buffering limits, separate health router

Both configs include top comments explaining the deployment model:
pdftract serve binds to 127.0.0.1:8080 with no auth; the reverse
proxy provides TLS termination and authentication.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-28 00:37:34 -04:00
jedarden
0959da819e docs(pdftract-1qoeb): add verification note for marked-content stack
The MarkedContentStack implementation was already complete.
All 45 tests pass (20 stack tests + 25 operator parser tests).

Acceptance criteria:
- push_bmc 64 times → all push; 65th emits MARKED_CONTENT_DEPTH_EXCEEDED 
- push_bmc N then pop_emc N → empty stack 
- pop_emc on empty stack → EmcUnderflow diagnostic 
- top_mcid returns Some(mcid) when top has MCID; None when empty 
- Unit tests cover push/pop balance, overflow, underflow 
- INV-8 (no panic) verified on all stack operations 

See notes/pdftract-1qoeb.md for details.
2026-05-28 00:35:29 -04:00
jedarden
b8d9b98155 docs(pdftract-1ofnz): add verification note
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-28 00:34:04 -04:00
jedarden
38b7496c70 feat(pdftract-1ofnz): implement detect_line_direction with unicode-bidi
- Add detect_line_direction() function using unicode_bidi::bidi_class
- Count L (LTR) vs R/AL (RTL) characters, return dominant direction
- Default to Ltr for empty/neutral-only strings (per bead acceptance criteria)
- Return Mixed only when LTR and RTL counts are tied (both > 0)
- Add comprehensive tests for Latin, Arabic, Hebrew, Cyrillic, and edge cases
- Fix header_footer test: remove nonexistent reading_order_rank field

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-28 00:33:49 -04:00
jedarden
55a612381b docs(pdftract-1qal2): add verification note for ConfidenceSource enum
The ConfidenceSource enum was already fully implemented with:
- Three variants (Native, Heuristic, Ocr) with lowercase serde
- Hash derive for HashMap usage
- Module docstring citing INV-9 stable taxonomy
- Public re-export in lib.rs
- All 4 tests passing

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-28 00:32:37 -04:00
jedarden
97c77a7b3e docs(pdftract-1ax1v): add verification note for ligature repair implementation
The repair_split_ligatures function was previously implemented in
commit 8cfbe70 as part of pdftract-1jkme. This verification note
documents the implementation and confirms all acceptance criteria
are met.

Acceptance criteria:
- U+FFFD adjacent to 'i', gap 0.05pt: repaired to "fi"/"ffi" by shape
- U+FFFD with no nearby f/l/i: not repaired
- U+FFFD adjacent to 'f': shape match disambiguates ffi/ffl/fi
- Multiple U+FFFD in span: each evaluated
- Returns true on any repair

All criteria PASS.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-28 00:29:35 -04:00
jedarden
a3b12409d0 docs(pdftract-1q4ku): add verification note for score_span_readability
The score_span_readability function was fully implemented in
pdftract-oh30a (commit 9970935). This verification note documents
the implementation status and confirms all acceptance criteria pass.

Acceptance criteria:
- AC1: All-printable English high coverage -> > 0.9 ✓
- AC2: All-U+FFFD -> < 0.1 ✓
- AC3: All-whitespace -> whitespace_score=0 ✓
- AC4: Low confidence -> scaled by confidence_floor ✓
- AC5: Non-English -> dict forced 1.0 ✓
- AC6: Ligature split -> integrity 0 lowers score ✓

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-28 00:29:26 -04:00
jedarden
a7c8d58881 docs(pdftract-1jkme): add verification note for sort_spans_in_line
All acceptance criteria PASS. Function was already implemented correctly.
Only fix needed was adding Arc import to correction.rs test module.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-28 00:22:07 -04:00
jedarden
8cfbe70ab7 fix(pdftract-1jkme): add missing Arc import to correction.rs test module
The test module was using Arc::from("Helvetica") but Arc was not in scope.
Added `use std::sync::Arc;` to fix compilation errors.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-28 00:21:46 -04:00