- Add build.rs that generates compile-time std14 metrics from JSON
- Add std14.rs module with Std14Metrics struct and get_std14_metrics()
- Add build/std14-metrics.json with AFM-derived widths for all 14 fonts
- Re-export Std14Metrics, NamedEncoding, get_std14_metrics in lib.rs
Acceptance criteria:
- All 14 Standard fonts (Courier, Helvetica, Times, Symbol, ZapfDingbats
and their variants) return valid metrics from the registry
- Subset-prefixed names (ABCDEF+Helvetica) resolve via strip_subset_prefix()
- Width tables match Adobe AFM data within rounding tolerance
- Binary footprint < 60 KB (generated source: 20 KB, actual data ~8 KB)
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
- Fix extract_page_inner typo: changed to extract_page (function was undefined)
- Add error_count field to ExtractionMetadata struct
- Add error field to PageResult struct (missing in constructor)
- Add semaphore module to lib.rs exports
The parallelism capping implementation was already in place but had bugs
preventing compilation. This fixes those bugs so the semaphore-based
bounding of in-flight pages works correctly.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Implements Phase 6.9.1: the two-byte-prefix directory scheme that keeps
any single directory under 65K entries even at millions of cached entries.
Changes:
- Add zstd dependency to Cargo.toml
- Create cache module with layout.rs implementing path construction
- Add CacheIndex struct for index.json metadata (schema version, timestamps)
- Implement entry_path(), fingerprint_dir(), parse helpers
- Add load_index()/save_index() for cache metadata persistence
- Ensure mkdir -p semantics with ensure_fingerprint_dir()
- 18 tests covering all acceptance criteria
Acceptance criteria verified:
✓ entry_path produces correct two-level prefix layout
✓ Different opts_hashes for same fingerprint share fp_dir
✓ Different fingerprints with same prefix share first-level dir
✓ index.json round-trips with schema version check
✓ Future schema version rejects cache with clear error
✓ mkdir -p creates prefix dirs; idempotent on concurrent writes
✓ Unicode-correct path handling via std::path::PathBuf
✓ Path length stays under 4096 bytes
Co-Authored-By: Claude Code <noreply@anthropic.com>
Implement the --receipts CLI flag accepting "off" | "lite" | "svg" with default "off".
Thread the ExtractionOptions.receipts field through the extraction pipeline so that
receipts are generated for spans and blocks based on the selected mode.
Changes:
- CLI: Added --receipts flag with clap value_parser for runtime validation
- CLI: Added feature check for SVG mode (requires 'receipts' feature)
- MCP tools: Added receipts field to ExtractArgs, ExtractTextArgs, ExtractMarkdownArgs
- MCP tools: Added build_extraction_options() to parse receipts mode
- Core: Added extract.rs module with extract_pdf(), extract_page(), generate_receipt()
- Core: Added ExtractionOptions with ReceiptsMode enum (Off/Lite/SvgClip)
- Core: Added receipts feature flag to Cargo.toml
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Implement the Receipt struct and lite-mode JSON serialization for
visual citation receipts. This provides cryptographic proof of
provenance for extracted text.
Changes:
- Add Receipt struct with 6 fields (pdf_fingerprint, page_index,
bbox, content_hash, extraction_version, svg_clip)
- Implement Receipt::lite() constructor with NFC normalization
- Integrate Receipt into SpanJson and BlockJson schemas
- Add unicode-normalization and serde_json dependencies
Acceptance criteria:
- Receipt::lite() produces valid receipts with svg_clip=None
- Lite mode JSON omits svg_clip key via skip_serializing_if
- Content hash uses NFC normalization for cross-platform stability
- Receipt wired into SpanJson and BlockJson types
Note: 100 receipts aggregate size is ~27 KB (not 15 KB as planned).
The 15 KB target is not achievable with required field sizes.
Refs: pdftract-5zm86, Phase 6.8 Visual Citation Receipts (lines 2351-2417)
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
This commit implements the Cargo.lock policy for reproducible builds
across all workspace members (pdftract-core, pdftract-cli, pdftract-py).
Changes:
- Add CONTRIBUTING.md with lockfile-update workflow documentation
- Add .renovaterc.json for weekly lockfile-only PRs (human-gated)
- Add crates/pdftract-core/README.md with rationale for checked-in lockfiles
- Add notes/pdftract-49f8.md with verification note
The Argo workflow updates (pdftract-ci.yaml) are committed separately
in the declarative-config repo.
Acceptance criteria:
- PASS: Cargo.lock tracked by git, not in .gitignore
- PASS: Argo workflow templates document --locked/--frozen requirements
- WARN: Enforcement to be completed when placeholder templates are implemented
- WARN: Binary reproducibility verification deferred to pdftract-build-binaries implementation
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Implement the conformance test runner pattern that every SDK will
implement to validate against the shared test suite.
- Rust reference implementation (crates/pdftract-core/tests/conformance.rs)
* Full test suite loader and executor
* Comparison engine with min/max, string constraints, tolerances
* Skip logic for unsupported features and schema versions
* Report generation in JSON format
- CLI compare subcommand (crates/pdftract-cli/src/main.rs)
* pdftract compare - Compare actual vs expected with tolerances
* Cross-language comparison tool to avoid reimplementations
- Documentation (docs/conformance/sdk-contract.md)
* Complete pattern specification with pseudocode
* Per-language runner locations
* CI integration requirements
- Python reference stub (tests/python-conformance/test_conformance.py)
* Full pytest-based implementation following the pattern
Closes: pdftract-5omc
Implement Merkle SHA-256 fingerprint algorithm for PDF structural
fingerprinting as specified in Phase 1.7 of the plan.
Components:
- FingerprintInput struct with page data and catalog flags
- Per-page hashing: content streams (normalized), resources (sorted),
geometry (4dp banker's rounding)
- Structure tree hash for tagged PDFs
- Catalog feature flag byte (encryption, JS, XFA, OCG)
Acceptance criteria:
- INV-3: 100% reproducible fingerprints (test passes)
- INV-13: Output format ^pdftract-v1:[0-9a-f]{64}$ (test passes)
- Performance: 100-page PDF in < 1ms (test passes)
- KU-7: WARN - no linearized fixtures available
Closes pdftract-q15sh
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Implement the document catalog parser (/Root traversal) for PDF documents.
The catalog parser extracts all key entries from the document catalog
including Pages, Outlines, MarkInfo, StructTreeRoot, AcroForm, Names,
Metadata, PageLabels, OCProperties, OpenAction, AA, and Version.
Key structures:
- MarkInfo: parses /MarkInfo dictionary with is_tagged, user_properties, suspects
- PageLabelStyle: enum for all label styles (D, R, r, A, a)
- PageLabel: single page label with style, prefix, and start value
- PageLabelsTree: number tree parser for /PageLabels with /Nums and /Kids support
- OcProperties: stub for OCG implementation (delegated to dedicated bead)
- Catalog: main catalog struct with all required and optional fields
Number tree implementation:
- Parses /Nums arrays (leaf nodes with alternating key-value pairs)
- Supports /Kids arrays (internal nodes for recursive tree traversal)
- Provides get_label_with_start() and get_label() methods for lookup
- Correctly formats roman numerals (uppercase/lowercase) and letter sequences
All 27 tests pass including proptests for fuzzing robustness (INV-8).
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>