Implements the 6 named-encoding character-code-to-glyph-name lookup
tables required by Level 2 of the encoding fallback chain:
- WinAnsiEncoding (Windows-1252 superset of StandardEncoding)
- MacRomanEncoding (Mac OS Roman encoding)
- MacExpertEncoding (Mac OS Expert character set)
- StandardEncoding (Adobe Standard encoding)
- SymbolEncoding (Symbol font encoding)
- ZapfDingbatsEncoding (Zapf Dingbats font encoding)
These tables map character codes (0-255) to glyph names, which are then
mapped to Unicode via the Adobe Glyph List (AGL).
Acceptance criteria:
- All 6 tables compile into static arrays with binary footprint < 30 KB
- WIN_ANSI[0x92] == Some("quoteright") (canonical WinAnsi test)
- MAC_ROMAN[0xD2] == Some("quotedblleft") and MAC_ROMAN[0xD3] == Some("quotedblright")
- STANDARD[0x20] == Some("space")
- NamedEncoding::from_name("WinAnsiEncoding") == Some(NamedEncoding::WinAnsi)
Files:
- crates/pdftract-core/build/named-encodings.json - Source data from ISO 32000-1 Annex D
- crates/pdftract-core/src/font/encoding.rs - Public API with NamedEncoding enum
- crates/pdftract-core/build.rs - Build script updates for encoding generation
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Implements Phase 7.1.2: StandardType -> BlockKind mapping for converting
walked StructElem nodes into the BlockKind taxonomy used by Phase 4 output.
Changes:
- Add BlockKind enum with all output block kinds (paragraph, heading with
level, table, list, list_item, figure, caption, code, block_quote, toc,
formula, reference, note, form_field_struct, inline, structural_container,
artifact, unknown)
- Add MappingResult struct bundling block_kind, is_emitted flag, and optional
diagnostic
- Add structure_type_to_block_kind() function for pure type mapping
- Add map_element_to_block() function as primary mapping API
- Add is_artifact() placeholder for Phase 3.4 marked-content integration
- Add 32 comprehensive unit tests covering all mapping paths
Key features:
- Complete type mapping for all 40+ PDF standard structure types
- Heading level extraction: H->level 1, H1..H6->level 1..6
- Inline elements (Span, Quote) map to Inline (not emitted as blocks)
- Structural containers (Document, Part, Sect, Div, etc.) map to
StructuralContainer (descend without emitting)
- Unknown types emit diagnostic and fall back to paragraph
Acceptance criteria:
- Every Standard structure type has a mapping decision
- Critical test: H1/H2 -> heading level 1/2
- Unit tests: list nesting, table grouping, span passthrough
- Unknown-type fallback path emits a diagnostic line
Refs: Plan section 7.1 lines 2552-2553
Implements the StructTree parser (Phase 7.1.1) with:
- Depth-first walker over /StructTreeRoot via /K array
- Support for all four /K entry types: StructElem, MCID, MCR, OBJR
- /RoleMap resolution with chain handling and cycle detection
- /Lang inheritance through the structure tree
- /ActualText inheritance (applies to all descendant content)
- Public API: StructureType, StructElemNode, StructTreeRoot, RoleMap, Kid
Acceptance criteria:
- PASS: All four /K element kinds handled without crashing
- PASS: /RoleMap chains resolve to standard type or NonStruct
- PASS: /Lang and /ActualText inherit correctly down tree
- PASS: Unit tests for Word RoleMap (Heading1 -> H1)
- PASS: Unit tests for nested /Lang and /ActualText scope
- PASS: Public type StructElemNode documented in core crate
References:
- Plan section 7.1 StructTree Exploitation (lines 2547-2549, 2552-2553)
- PDF 1.7 spec 14.7.4 (Structure Tree) and 14.8.4 (Standard Structure Types)
Co-Authored-By: Claude Code <noreply@anthropic.com>
The ToUnicode CMap parser (Level 1) implementation was already complete
in crates/pdftract-core/src/font/cmap.rs. This commit fixes test assertion
type mismatches where arrays were compared to slices.
Changes:
- Fixed array-to-slice conversions in test assertions (e.g., &['A'] -> &['A'][..])
- Fixed test_odd_length_utf16_emits_diagnostic to use correct hex string input
- All 18 CMap parser tests now pass
Acceptance criteria verified:
- beginbfchar with single-codepoint (U+FB01 fi ligature)
- beginbfchar with multi-codepoint expansion (<00660069> -> 'f' 'i')
- beginbfrange contiguous range (A..=Z mapping)
- beginbfrange explicit array form
- Comment stripping (%)
- Variable-width source codes
- Multi-codepoint destinations in contiguous ranges
Closes: pdftract-udz
This commit completes Phase 5.2.2 by integrating the pdfium-render path
into serve mode with runtime validation and feature propagation.
Changes:
- Propagate ocr and full-render features from CLI to pdftract-core
- Add full_render parameter to serve mode ExtractParams
- Implement runtime validation in build_options():
* Returns BadRequest if full_render requested but PDFium unavailable
* Falls back to direct compositing if feature not compiled
- Update all three serve handlers to handle Result from build_options()
Acceptance Criteria:
✅ cargo build --features ocr,serve,full-render succeeds
✅ cargo build --features ocr,serve (no full-render) succeeds
✅ Runtime fallback: full_render=true with feature absent uses direct path
Notes:
- Binary size CI gate (140 MB) requires separate CI infrastructure
- Soft-mask regression tests require separate fixture work
Refs: pdftract-4my
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Complete verification of direct image compositing path implementation.
All 23 unit tests pass covering CTM tracking, image placement, rotation,
and soft mask handling.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
All 5 child beads completed:
- pdftract-3uq: Font subtype classifier and BaseFont prefix stripper
- pdftract-juc: Standard 14 font registry with hardcoded metrics
- pdftract-6ah: Embedded font program loader (ttf-parser/owned_ttf_parser)
- pdftract-cv4: Type 0 composite font + descendant CIDFont loader
- pdftract-5sh: CIDToGIDMap resolver (Identity and stream forms)
77 font module tests pass. Acceptance criteria:
- PASS: All children closed
- PASS: Classifier returns all 8 FontKind variants
- PASS: Subset prefix stripping works correctly
- PASS: CIDToGIDMap Identity and stream forms verified
- PASS: No unwrap/expect on resource dict access
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Implements `load_type0(font_dict)` following /DescendantFonts to the
CIDFont dictionary, classifying the descendant as CIDFontType0 or
CIDFontType2, reading /DW (default width), parsing /W array (two
formats: per-CID [c [w1 w2...]] and range [cfirst clast w]), and
producing Type0Font containing both parent and descendant.
Acceptance criteria met:
- Type0 font with CIDFontType2 descendant loads
- Widths from [10 [500 600]] resolve: CID 10 -> 500, CID 11 -> 600
- Range form [100 200 800] resolves: CIDs 100..=200 all -> 800
- Missing CID falls back to DW (default 1000)
- CIDFontType0 (CFF) descendant uses ttf-parser CFF entrypoint
Co-Authored-By: Claude Code <noreply@anthropic.com>
Updated notes/pdftract-2zw.md to reflect that the page classification
fixture integration test suite now has 5 tests (added
test_reproducibility_gate_with_perturbation).
Co-Authored-By: Claude Code <noreply@anthropic.com>
Adds test_reproducibility_gate_with_perturbation which verifies that the
reproducibility check correctly detects when classification results differ.
This test intentionally perturbs a confidence value and asserts that the
reproducibility gate fails with a clear diff message.
Acceptance criteria for pdftract-2zw:
- Reproducibility gate fails on intentional perturbation
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
All acceptance criteria PASS:
- TrueType font from fixture: glyph_id_for('A') matches Face cmap
- OpenType CFF support: handled via OpenTypeMetrics
- Type1 limited capability: graceful without CharStrings parser
- Corrupt font handling: FONT_PARSE_FAILED diagnostic emitted
15/15 embedded font tests passing.
Updates the needle tracking file to the latest commit
for the PageClassifier engine implementation.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Update notes/pdftract-33g.md to reflect:
- Micro-benchmark test now PASS (p99 < 5 ms)
- Test count updated from 53 to 54
- Future work section updated (benchmark item removed)
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
- Add build.rs that generates compile-time std14 metrics from JSON
- Add std14.rs module with Std14Metrics struct and get_std14_metrics()
- Add build/std14-metrics.json with AFM-derived widths for all 14 fonts
- Re-export Std14Metrics, NamedEncoding, get_std14_metrics in lib.rs
Acceptance criteria:
- All 14 Standard fonts (Courier, Helvetica, Times, Symbol, ZapfDingbats
and their variants) return valid metrics from the registry
- Subset-prefixed names (ABCDEF+Helvetica) resolve via strip_subset_prefix()
- Width tables match Adobe AFM data within rounding tolerance
- Binary footprint < 60 KB (generated source: 20 KB, actual data ~8 KB)
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Implement FontKind enum and classify_font() function for Phase 2.1
font type detection. Includes strip_subset_prefix() for handling
font subset names (e.g., ABCDEF+Times-Roman).
FontKind variants:
- Type1, Type1Std14 (Standard 14)
- TrueType, OpenTypeCFF
- Type0, CIDFontType0, CIDFontType2
- Type3
Classifier reads /Subtype, /BaseFont, and for Type0 fonts, descendant
CIDFont subtype. OpenTypeCFF detected via /FontDescriptor /FontFile3
with /Subtype /OpenType.
All 27 font tests pass.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Add verification note documenting memory ceiling implementation
for fuzz and proptest harnesses.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Add 4 new tests to verify PNG and TIFF predictor functions use row-by-row
processing with bounded peak memory (2x stride), never pre-allocating full
output buffers inside tests.
- test_png_predictor_budget_enforcement_small_fixture: 200-byte fixture,
100-byte budget, verifies truncation at row boundary
- test_tiff_predictor_2_budget_enforcement_small_fixture: 160-byte fixture,
80-byte budget, verifies row-by-row processing for grayscale
- test_png_predictor_multiple_selectors_budget_per_row: 25-byte fixture
with all PNG selector types, verifies per-row budget checking
- test_tiff_predictor_2_rgb_budget_enforcement: 45-byte RGB fixture,
verifies multi-byte pixel handling with budget enforcement
All fixtures are under 250 bytes, no full-buffer pre-allocation, tests
mirror the row-by-row discipline from bf-49wmw production fix.
Closes bf-21hw8
- Fix test_bomb_limit_flate to actually test early abort behavior
- Use 200-byte pattern (not large buffers) that compresses to ~50 bytes
- Set bomb_limit to 50 bytes to force truncation
- Assert output.len() < pattern.len() to verify truncation occurred
- Add documentation explaining the minimal input approach
Per bf-4xk2v: "Decompression-bomb and max_decompress_bytes tests must
trigger the STREAM_BOMB abort WITHOUT building the multi-GB decoded output
in memory. Use minimal crafted inputs and assert the byte-budget limit fires
early. Never pre-size a Vec to the claimed or decompressed length."
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
- Add decode_page_content_streams() function for per-page lazy decode
- Update extract_page_from_dict() to support lazy stream decoding
- Modify extract_pdf() and extract_pdf_ndjson() to enable lazy decoding
- Fix borrow checker issue in LazyPageIter::next()
This ensures content streams are decoded lazily per page and dropped
immediately after processing, keeping peak RSS flat across page count.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Documents the bug fixes made to enable the semaphore-based parallel
page extraction implementation to compile and work correctly.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
- Fix extract_page_inner typo: changed to extract_page (function was undefined)
- Add error_count field to ExtractionMetadata struct
- Add error field to PageResult struct (missing in constructor)
- Add semaphore module to lib.rs exports
The parallelism capping implementation was already in place but had bugs
preventing compilation. This fixes those bugs so the semaphore-based
bounding of in-flight pages works correctly.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Verification note for the completion of Phase 0: CI Infrastructure epic.
All 10 sub-phase coordinator beads are closed:
- pdftract-1wqec: WorkflowTemplate scaffolding
- pdftract-1bn: Cross-compilation build matrix (5 targets)
- pdftract-30n: Test execution (musl + glibc)
- pdftract-2rf: Static analysis and quality gates
- pdftract-33v: Property tests and nightly fuzz
- pdftract-2t9: Regression corpus runner (500 PDFs)
- pdftract-60h: Competitive benchmarks (Tier 4)
- pdftract-23k1: pdftract-py-ci stub
- pdftract-4b0z: Release publishing
- pdftract-3i1o: CI observability
This epic adds the final missing piece: the CI sensor that triggers
pdftract-ci workflow on push and PR events.
See also: ci(pdftract-4nj7y) in declarative-config
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Document the implementation and verification of the test-matrix DAG
branch with musl and glibc test legs.
Summary:
- Created pdftract-test-image-build WorkflowTemplate
- Verified test-matrix DAG implementation (test-glibc, test-musl)
- Both legs emit JUnit XML for test reporting
- Acceptance criteria: PASS (with notes on setup step and Docker image)
Known dependencies:
- Setup step still a placeholder (handled by separate Phase 0 bead)
- Docker image needs to be built via pdftract-test-image-build workflow
Relates to pdftract-30n: Phase 0.3 Test execution — cargo test on musl + glibc
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Bead-Id: pdftract-30n
Document acceptance criteria PASS status for:
- Custom Docker image with OCR support
- nextest configuration with ci/ci-proptest profiles
- Updated test-glibc template in CI
All criteria PASS. Ready to close bead.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Add .config/nextest.toml with ci and ci-proptest profiles:
- ci: JUnit output, 60s slow test timeout, retry on flaky tests
- ci-proptest: Higher timeouts, no retries for proptest
Relates to pdftract-5rvp9: Phase 0.3b glibc test leg implementation
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Convert test-matrix from single container to DAG with two parallel branches:
- test-glibc: Full test suite including OCR (tesseract available on Debian)
- test-musl: Production binary feature set (no OCR, unavailable on Alpine)
Musl leg configuration:
- Image: ghcr.io/cross-rs/x86_64-unknown-linux-musl:main
- Test: cross test --release --target x86_64-unknown-linux-musl --features default,serve,decrypt
- Output: JUnit XML artifact (test-results-musl.xml)
- Test threads: 4 (parallel execution)
Also updates:
- .nextest.toml: Add JUnit XML output settings to profile.ci
- Cross.toml: Add cross configuration for musl target
Bead: pdftract-5gtcj
Plan section: Phase 0.3
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>