- Fixed duplicate Luma import: `use image::{GrayImage, ImageBuffer, Luma, Luma}` → `use image::{GrayImage, ImageBuffer, Luma}`
- Added re-exports in lib.rs for all preprocessing functions
- Updated verification note
The border padding, pipeline orchestration, and fixtures were already
implemented from previous work. This commit cleans up a minor duplicate
import issue.
Related: pdftract-27n3
Implement step 5 (white-border padding: 10 px on all sides), wire all
preprocessing steps into the final preprocess(input, ImageSource) ->
GrayImage entry point, and curate fixtures for the three image-source
paths (PhysicalScan / DigitalOrigin / Jbig2).
Changes:
- Add add_border_padding() function: creates (width+20) x (height+20)
image with 10px white border on all sides
- Add preprocess() pipeline orchestrator: applies deskew, contrast
normalization, binarization, denoising, and padding in correct order
- Skip contrast, binarization, and denoising for JBIG2 images
- Generate test fixtures for skewed_2deg, uneven_lighting, clean_digital,
and jbig2_scan scenarios
- Add integration tests for all critical test scenarios
- Add A4-page benchmarks targeting < 500ms for physical/digital, < 200ms
for JBIG2
Refs:
- Plan section: Phase 5.3 step 5 (line 1878) + critical tests (lines 1882-1885)
- Bead: pdftract-27n3
- Note: notes/pdftract-27n3.md
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Implement Level 3 of the encoding fallback chain. Hash the raw decoded
font program bytes (/FontFile, /FontFile2, /FontFile3) with SHA-256
and look up the 32-byte digest in a compile-time phf::Map.
- build.rs: generate_font_fingerprints() reads JSON, builds phf::Map
- src/font/fingerprint.rs: FontFingerprint, CachedFingerprint, lookup API
- build/font-fingerprints.json: empty database (placeholder)
Acceptance criteria:
- Empty JSON produces valid phf::Map
- Hash is stable across runs
- Lookup of unknown digest returns None
- Binary footprint < 500KB for 200-font DB (empty = negligible)
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Add [package.metadata.binstall] to crates/pdftract-cli/Cargo.toml to enable
cargo binstall to download pre-built binaries from GitHub Releases instead
of compiling from source. Also add comprehensive Installation section to
README.md documenting cargo binstall as the recommended install method.
Bead: pdftract-1u80
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Added three new tests to verify the deskew acceptance criteria:
- test_deskew_2_degree_skew: Verifies 2-degree skew is deskewed within 0.1 deg
- test_deskew_0_2_degree_skew_skipped: Verifies 0.2-degree skew is skipped
- test_deskew_20_degree_skew_out_of_range: Verifies out-of-range diagnostic
Helper function create_skewed_text_lines() creates synthetic test images
with known skew angles using small-angle trigonometric approximations.
Note: Tests compile but cannot run without leptonica library (NixOS limitation).
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Implement the deskew preprocessing step using leptonica's
pixFindSkewAndDeskew (Hough line transform). The function:
- Detects dominant text angle on grayscale input
- Rotates by negative angle if >= 0.3 deg threshold
- Returns input unchanged for negligible skews (< 0.3 deg)
- Emits IMG_DESKEW_OUT_OF_RANGE diagnostic for angles > 15 deg
- Returns detected angle for quality tracking
Changes:
- Add leptonica-plumbing dependency (ocr feature)
- Create preprocess.rs module with deskew() function
- Add ImgDeskewOutOfRange diagnostic code
- Expose preprocess module in lib.rs
The implementation uses pixFindSkewAndDeskew which both detects
the skew angle and performs deskewing in one call, returning
the detected angle for debugging purposes.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Add two comprehensive integration tests to validate the ParentTree resolver:
1. test_parent_tree_annotation_with_struct_parent:
- Creates a body paragraph StructElem
- Creates ParentTree with page array (MCID 0 -> body, MCID 1 -> orphan/null)
- Creates ParentTree with annotation entry (key 100 -> body)
- Verifies MCID resolution returns correct map and orphans
- Verifies annotation /StructParent resolution returns the body ref
- Verifies the referenced StructElem is in the tree
2. test_parent_tree_off_by_one_missing_entries:
- Creates ParentTree with sparse array (only 3 entries for potentially more MCIDs)
- Verifies non-null entries are correctly mapped
- Verifies null entries are recorded as orphans
- Documents that MCIDs beyond array length would be detected in Phase 7.1.4
Also export ParentTreeResolver and ParentTreeEntry from parser module
for use by the block builder in Phase 7.1.4.
All 67 struct_tree tests pass (18 ParentTree-specific tests).
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
- Fix 8 tests that incorrectly passed ParentTree dict directly instead of
wrapping it in a StructTreeRoot-like structure with /ParentTree key
- Fix process_nums_array() to preserve null entries as ObjRef { object: 0 }
instead of filtering them out, ensuring orphan MCIDs are correctly reported
- Add verification note for ParentTree-based MCID-to-StructElem resolver
References: pdftract-57o4, plan 7.1 line 2550 (MCID-to-StructElem mapping)
Implements the 6 named-encoding character-code-to-glyph-name lookup
tables required by Level 2 of the encoding fallback chain:
- WinAnsiEncoding (Windows-1252 superset of StandardEncoding)
- MacRomanEncoding (Mac OS Roman encoding)
- MacExpertEncoding (Mac OS Expert character set)
- StandardEncoding (Adobe Standard encoding)
- SymbolEncoding (Symbol font encoding)
- ZapfDingbatsEncoding (Zapf Dingbats font encoding)
These tables map character codes (0-255) to glyph names, which are then
mapped to Unicode via the Adobe Glyph List (AGL).
Acceptance criteria:
- All 6 tables compile into static arrays with binary footprint < 30 KB
- WIN_ANSI[0x92] == Some("quoteright") (canonical WinAnsi test)
- MAC_ROMAN[0xD2] == Some("quotedblleft") and MAC_ROMAN[0xD3] == Some("quotedblright")
- STANDARD[0x20] == Some("space")
- NamedEncoding::from_name("WinAnsiEncoding") == Some(NamedEncoding::WinAnsi)
Files:
- crates/pdftract-core/build/named-encodings.json - Source data from ISO 32000-1 Annex D
- crates/pdftract-core/src/font/encoding.rs - Public API with NamedEncoding enum
- crates/pdftract-core/build.rs - Build script updates for encoding generation
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Implements Phase 7.1.2: StandardType -> BlockKind mapping for converting
walked StructElem nodes into the BlockKind taxonomy used by Phase 4 output.
Changes:
- Add BlockKind enum with all output block kinds (paragraph, heading with
level, table, list, list_item, figure, caption, code, block_quote, toc,
formula, reference, note, form_field_struct, inline, structural_container,
artifact, unknown)
- Add MappingResult struct bundling block_kind, is_emitted flag, and optional
diagnostic
- Add structure_type_to_block_kind() function for pure type mapping
- Add map_element_to_block() function as primary mapping API
- Add is_artifact() placeholder for Phase 3.4 marked-content integration
- Add 32 comprehensive unit tests covering all mapping paths
Key features:
- Complete type mapping for all 40+ PDF standard structure types
- Heading level extraction: H->level 1, H1..H6->level 1..6
- Inline elements (Span, Quote) map to Inline (not emitted as blocks)
- Structural containers (Document, Part, Sect, Div, etc.) map to
StructuralContainer (descend without emitting)
- Unknown types emit diagnostic and fall back to paragraph
Acceptance criteria:
- Every Standard structure type has a mapping decision
- Critical test: H1/H2 -> heading level 1/2
- Unit tests: list nesting, table grouping, span passthrough
- Unknown-type fallback path emits a diagnostic line
Refs: Plan section 7.1 lines 2552-2553
Implements the StructTree parser (Phase 7.1.1) with:
- Depth-first walker over /StructTreeRoot via /K array
- Support for all four /K entry types: StructElem, MCID, MCR, OBJR
- /RoleMap resolution with chain handling and cycle detection
- /Lang inheritance through the structure tree
- /ActualText inheritance (applies to all descendant content)
- Public API: StructureType, StructElemNode, StructTreeRoot, RoleMap, Kid
Acceptance criteria:
- PASS: All four /K element kinds handled without crashing
- PASS: /RoleMap chains resolve to standard type or NonStruct
- PASS: /Lang and /ActualText inherit correctly down tree
- PASS: Unit tests for Word RoleMap (Heading1 -> H1)
- PASS: Unit tests for nested /Lang and /ActualText scope
- PASS: Public type StructElemNode documented in core crate
References:
- Plan section 7.1 StructTree Exploitation (lines 2547-2549, 2552-2553)
- PDF 1.7 spec 14.7.4 (Structure Tree) and 14.8.4 (Standard Structure Types)
Co-Authored-By: Claude Code <noreply@anthropic.com>
The ToUnicode CMap parser (Level 1) implementation was already complete
in crates/pdftract-core/src/font/cmap.rs. This commit fixes test assertion
type mismatches where arrays were compared to slices.
Changes:
- Fixed array-to-slice conversions in test assertions (e.g., &['A'] -> &['A'][..])
- Fixed test_odd_length_utf16_emits_diagnostic to use correct hex string input
- All 18 CMap parser tests now pass
Acceptance criteria verified:
- beginbfchar with single-codepoint (U+FB01 fi ligature)
- beginbfchar with multi-codepoint expansion (<00660069> -> 'f' 'i')
- beginbfrange contiguous range (A..=Z mapping)
- beginbfrange explicit array form
- Comment stripping (%)
- Variable-width source codes
- Multi-codepoint destinations in contiguous ranges
Closes: pdftract-udz
This commit completes Phase 5.2.2 by integrating the pdfium-render path
into serve mode with runtime validation and feature propagation.
Changes:
- Propagate ocr and full-render features from CLI to pdftract-core
- Add full_render parameter to serve mode ExtractParams
- Implement runtime validation in build_options():
* Returns BadRequest if full_render requested but PDFium unavailable
* Falls back to direct compositing if feature not compiled
- Update all three serve handlers to handle Result from build_options()
Acceptance Criteria:
✅ cargo build --features ocr,serve,full-render succeeds
✅ cargo build --features ocr,serve (no full-render) succeeds
✅ Runtime fallback: full_render=true with feature absent uses direct path
Notes:
- Binary size CI gate (140 MB) requires separate CI infrastructure
- Soft-mask regression tests require separate fixture work
Refs: pdftract-4my
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Complete verification of direct image compositing path implementation.
All 23 unit tests pass covering CTM tracking, image placement, rotation,
and soft mask handling.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
All 5 child beads completed:
- pdftract-3uq: Font subtype classifier and BaseFont prefix stripper
- pdftract-juc: Standard 14 font registry with hardcoded metrics
- pdftract-6ah: Embedded font program loader (ttf-parser/owned_ttf_parser)
- pdftract-cv4: Type 0 composite font + descendant CIDFont loader
- pdftract-5sh: CIDToGIDMap resolver (Identity and stream forms)
77 font module tests pass. Acceptance criteria:
- PASS: All children closed
- PASS: Classifier returns all 8 FontKind variants
- PASS: Subset prefix stripping works correctly
- PASS: CIDToGIDMap Identity and stream forms verified
- PASS: No unwrap/expect on resource dict access
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Implements `load_type0(font_dict)` following /DescendantFonts to the
CIDFont dictionary, classifying the descendant as CIDFontType0 or
CIDFontType2, reading /DW (default width), parsing /W array (two
formats: per-CID [c [w1 w2...]] and range [cfirst clast w]), and
producing Type0Font containing both parent and descendant.
Acceptance criteria met:
- Type0 font with CIDFontType2 descendant loads
- Widths from [10 [500 600]] resolve: CID 10 -> 500, CID 11 -> 600
- Range form [100 200 800] resolves: CIDs 100..=200 all -> 800
- Missing CID falls back to DW (default 1000)
- CIDFontType0 (CFF) descendant uses ttf-parser CFF entrypoint
Co-Authored-By: Claude Code <noreply@anthropic.com>
Updated notes/pdftract-2zw.md to reflect that the page classification
fixture integration test suite now has 5 tests (added
test_reproducibility_gate_with_perturbation).
Co-Authored-By: Claude Code <noreply@anthropic.com>
Adds test_reproducibility_gate_with_perturbation which verifies that the
reproducibility check correctly detects when classification results differ.
This test intentionally perturbs a confidence value and asserts that the
reproducibility gate fails with a clear diff message.
Acceptance criteria for pdftract-2zw:
- Reproducibility gate fails on intentional perturbation
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
All acceptance criteria PASS:
- TrueType font from fixture: glyph_id_for('A') matches Face cmap
- OpenType CFF support: handled via OpenTypeMetrics
- Type1 limited capability: graceful without CharStrings parser
- Corrupt font handling: FONT_PARSE_FAILED diagnostic emitted
15/15 embedded font tests passing.
Updates the needle tracking file to the latest commit
for the PageClassifier engine implementation.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Update notes/pdftract-33g.md to reflect:
- Micro-benchmark test now PASS (p99 < 5 ms)
- Test count updated from 53 to 54
- Future work section updated (benchmark item removed)
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
- Add build.rs that generates compile-time std14 metrics from JSON
- Add std14.rs module with Std14Metrics struct and get_std14_metrics()
- Add build/std14-metrics.json with AFM-derived widths for all 14 fonts
- Re-export Std14Metrics, NamedEncoding, get_std14_metrics in lib.rs
Acceptance criteria:
- All 14 Standard fonts (Courier, Helvetica, Times, Symbol, ZapfDingbats
and their variants) return valid metrics from the registry
- Subset-prefixed names (ABCDEF+Helvetica) resolve via strip_subset_prefix()
- Width tables match Adobe AFM data within rounding tolerance
- Binary footprint < 60 KB (generated source: 20 KB, actual data ~8 KB)
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Implement FontKind enum and classify_font() function for Phase 2.1
font type detection. Includes strip_subset_prefix() for handling
font subset names (e.g., ABCDEF+Times-Roman).
FontKind variants:
- Type1, Type1Std14 (Standard 14)
- TrueType, OpenTypeCFF
- Type0, CIDFontType0, CIDFontType2
- Type3
Classifier reads /Subtype, /BaseFont, and for Type0 fonts, descendant
CIDFont subtype. OpenTypeCFF detected via /FontDescriptor /FontFile3
with /Subtype /OpenType.
All 27 font tests pass.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Add verification note documenting memory ceiling implementation
for fuzz and proptest harnesses.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Add 4 new tests to verify PNG and TIFF predictor functions use row-by-row
processing with bounded peak memory (2x stride), never pre-allocating full
output buffers inside tests.
- test_png_predictor_budget_enforcement_small_fixture: 200-byte fixture,
100-byte budget, verifies truncation at row boundary
- test_tiff_predictor_2_budget_enforcement_small_fixture: 160-byte fixture,
80-byte budget, verifies row-by-row processing for grayscale
- test_png_predictor_multiple_selectors_budget_per_row: 25-byte fixture
with all PNG selector types, verifies per-row budget checking
- test_tiff_predictor_2_rgb_budget_enforcement: 45-byte RGB fixture,
verifies multi-byte pixel handling with budget enforcement
All fixtures are under 250 bytes, no full-buffer pre-allocation, tests
mirror the row-by-row discipline from bf-49wmw production fix.
Closes bf-21hw8
- Fix test_bomb_limit_flate to actually test early abort behavior
- Use 200-byte pattern (not large buffers) that compresses to ~50 bytes
- Set bomb_limit to 50 bytes to force truncation
- Assert output.len() < pattern.len() to verify truncation occurred
- Add documentation explaining the minimal input approach
Per bf-4xk2v: "Decompression-bomb and max_decompress_bytes tests must
trigger the STREAM_BOMB abort WITHOUT building the multi-GB decoded output
in memory. Use minimal crafted inputs and assert the byte-budget limit fires
early. Never pre-size a Vec to the claimed or decompressed length."
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>