Commit graph

49 commits

Author SHA1 Message Date
jedarden
1791bb6d80 docs(pdftract-32y9): finalize SDK architecture note with workspace layout, cross-compile matrix, and KU-12 alignment
- Add workspace layout section documenting pdftract-core as the only direct dependency,
  with pdftract-cli, pdftract-py, and pdftract-inspector-ui as siblings
- Update binary distribution table with correct target triples (musl not gnu for Linux)
- Add KU-12 cross-platform test limitation section with verbatim wording from plan:
  "Linux is fully CI-tested; macOS and Windows are build-tested and manually smoke-tested per release"
- Add Argo CI templates section (pdftract-cargo-build, pdftract-maturin-build)
- Add feature flag composition section with tiers, dependencies, and binary size budgets
- Add cross-references to sdk-invocation.md, sdk-contract.md, ocr-language-packs.md
- Fix clippy warnings in build.rs files (expect_fun_call, get_first, manual_strip, unused imports)

Closes: pdftract-32y9

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-24 06:38:23 -04:00
jedarden
67b3fde4d6 feat(pdftract-j6yd): implement signatures array output + validation_status enum + schema integration
Add document-level /signatures array output per Phase 7.3 of the plan.

Changes:
- Add SignatureJson struct to schema module with all signature metadata fields
- Update ExtractionResult to include signatures: Vec<SignatureJson>
- Integrate signature extraction into extract_pdf() pipeline
- Update result_to_json() to include signatures in JSON output
- Update JSON schema with signatures array and SignatureJson definition
- Add markdown sink signatures footer when signatures are present
- Add comprehensive tests for signature JSON serialization and validation

Acceptance criteria:
- Schema tests: 5/5 signature JSON tests pass
- Markdown sink emits Signatures footer when count > 0
- PyO3 binding automatically handles Vec<SignatureJson> via serde
- docs/schema/v1.0/pdftract.schema.json updated with signatures shape

Verification note: notes/pdftract-j6yd.md

Closes: pdftract-j6yd
2026-05-24 04:05:34 -04:00
jedarden
d174725241 docs(pdftract-5vhp): bring word-boundary-reconstruction.md to v1.0 final-pass
Complete documentation of the adaptive word-boundary algorithm including:
- Initial threshold = 0.25 * font_size
- 20-glyph median adjustment
- 1.5x median formula
- Full Tc/Tw/Tz (character-spacing, word-spacing, horizontal-scaling) corrections

Expanded from 202 lines to 899 lines with:
- Section 3.1: Tc/Tw/Tz formula with explicit parameter table
- Section 3.2: Text-space vs. device-space comparison per plan line 1550
- Section 4: Adaptive algorithm specification (20-glyph window, 1.5× median, outlier exclusion)
- Section 11: Complete pseudo-code (data structures, main loop, detection, threshold computation)
- Section 12: Edge cases (ZWJ, combining marks, CJK, justified text, monospaced, RTL, ligatures, soft hyphens, tabs)
- Section 13: Validation methodology (corpus at tests/fixtures/word-boundary-corpus/, 141 PDFs, 8 categories)
- Section 14: Implementation checklist and references

Closes: pdftract-5vhp
2026-05-24 03:55:43 -04:00
jedarden
28c31ba0a1 feat(pdftract-vk0gc): implement markdown anchors with parser regex
Add --md-anchors flag that emits HTML comment markers before each block
in Markdown output, allowing downstream tools to map excerpts back to
precise PDF locations.

Changes:
- Add markdown module with Anchor struct and parse_anchors() function
- Regex: <!-- pdftract: page=(\d+) block=(\d+) bbox=[([\d.,]+)] kind=(\w+) -->
- Add markdown_anchors: bool to ExtractionOptions
- Add --md-anchors CLI flag
- Implement block_to_markdown() and page_to_markdown() functions
- Add comprehensive documentation in docs/integrations/markdown-anchors.md
- 16 unit tests pass, including roundtrip test

Closes: pdftract-vk0gc
2026-05-24 02:49:16 -04:00
jedarden
cf8f04e3ec docs(pdftract-26r8): finalize glyph recognition research note v1.0
- Reorganize around the four-level Unicode recovery cascade from plan
- Document all cascade levels with confidence scores:
  - Level 1: ToUnicode CMap (1.0)
  - Level 2: Encoding + AGL (0.9)
  - Level 3: Font fingerprint cache (0.85)
  - Level 4: Glyph shape recognition (0.7)
- Add shape database design (pHash algorithm, query, format)
- Document pHash collision tie-break rules (frequency-based)
- Add Type 3 font handling section
- Cross-reference Phase 2.2, 2.4, 2.5 and OQ-02

File grows from 112 to 210 lines. Covers all acceptance criteria.

Closes: pdftract-26r8
2026-05-24 02:10:06 -04:00
jedarden
92e90af0b0 feat(pdftract-zy2jx): generate JSON Schema from Rust output types
- Add schemars dependency to pdftract-core (v1.2)
- Add JsonSchema derives to output types (ExtractionResult, PageResult, ExtractionMetadata, SpanJson, BlockJson, CellJson, RowJson, TableJson, ExtractionQuality, Receipt, ReceiptsMode)
- Create xtask/src/bin/gen_schema.rs for schema generation
- Add gen-schema command to xtask main.rs
- Generate docs/schema/v1.0/pdftract.schema.json using Draft 2020-12

Schema includes:
- $schema: "https://json-schema.org/draft/2020-12/schema"
- $defs with all output type definitions
- Proper type annotations for all fields

Closes: pdftract-zy2jx
2026-05-24 01:29:14 -04:00
jedarden
bf37f0f05f docs(pdftract-645y): finalize extraction-output-schema.md v1.0 with all Phase 6.1 fields
This commit brings docs/research/extraction-output-schema.md to v1.0 final-pass
specification, aligning with Phase 6.1 deliverables and plan requirements.

**Key additions:**
- page_number field documented with page_index relationship (1-based vs 0-based)
- page_type enum expanded with all six values: text, scanned, mixed, broken_vector,
  blank, figure_only — with broken_vector cross-referenced to Phase 5.5
- Block kind enum fully documented: paragraph, heading, list, table, figure, caption,
  code, formula, watermark, header, footer
- Attachments schema with base64 contentEncoding and 50MB truncation rule
- Profile-based classification fields (document_type, document_type_confidence,
  document_type_reasons, profile_name, profile_version, profile_fields)
- Schema Version Compatibility section with additive-evolution rules
- JSON Schema cross-reference throughout

**Format changes:**
- Restructured with ATX headings (## for sections)
- Added explicit field tables for each major schema section
- Cross-linked to machine-readable JSON Schema at docs/schema/v1.0/pdftract.schema.json
- Grew from 81 lines to 304 lines per acceptance criteria

**Plan references:**
- Lines 97, 2002-2030, 2017, 1836, 2640, 1709, 1752, 2989-3006, 3659
- INV-9 page_type taxonomy stability

Co-Authored-By: Claude Code (GLM-4.7) <noreply@anthropic.com>
2026-05-24 00:59:23 -04:00
jedarden
d14ec92fcb feat(pdftract-3zhf): add unified TableDetector::detect entry point
Add unified detect() method to TableDetector that combines both
line-based and borderless table detection pipelines. This completes
the coordinator bead for Phase 7.2: Table Detection and Structure
Reconstruction.

All child beads (7.2.1-7.2.6) are closed:
- 7.2.1: Line-based detection (path segment clustering)
- 7.2.2: Borderless detection (x0 alignment heuristic)
- 7.2.3: Span-to-cell assignment (centroid containment)
- 7.2.4: Header row detection (bold + StructTree TH)
- 7.2.5: Merged cell detection (missing interior edges)
- 7.2.6: Table JSON output schema integration

Critical tests pass:
- 5x3 bordered table (15 cells extracted)
- Merged header cell colspan=3
- Borderless 3-column table detection
- Two-page table continuation detection

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-24 00:51:59 -04:00
jedarden
33372c23ae fix(pdftract-3c4i): export detect_merged_cells from table module
The detect_merged_cells function was implemented but not exported from
the table module, making it inaccessible to library users. This commit
adds the function to the public API exports.

Also adds a verification note documenting the complete implementation
and the export fix.

Acceptance criteria status:
- All 6 merged cell detection tests pass
- Public Cell.rowspan/colspan fields exist with default 1
- Absorbed cells are excluded from output
- Bbox of merged cell covers absorbed cells
- Borderless tables NO-OP with diagnostic

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-24 00:23:14 -04:00
jedarden
26bdd255c8 feat(pdftract-ilen): implement header row detection with bold+TH support
Implement header row detection for tables using two signals:
1. Bold font detection (fully implemented)
2. StructTree TH detection (stub pending MCID tracking)

Bold detection:
- is_bold_font(): detects bold fonts from PostScript name patterns
- is_cell_bold(): checks if all non-whitespace content in a cell is bold
- is_bold_header_row(): validates rows with >=2 bold cells
- count_header_rows(): counts contiguous bold headers from top
- Cell::mark_header_rows(): sets is_header_row flag on cells

TH detection (stub):
- is_th_header_row(): placeholder for StructTree TH detection
  Requires MCID tracking on TableSpan (future work)
  Will use ParentTree to map MCIDs to StructElems
  Will verify TR > TH chain structure

Combined detection:
- is_header_row(): combines bold and TH signals
- Bold wins on conflict per body data design principle

Documentation:
- Updated table-structure-reconstruction.md with full header detection spec
- Documented implemented vs pending signals
- Added implementation notes for TH detection

Tests:
- 45 tests covering all bold detection scenarios
- Tests for multi-row headers (contiguous from top)
- Tests for single-cell row exclusion
- Tests for empty/whitespace cell handling
- Placeholder tests for TH detection

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-23 23:32:54 -04:00
jedarden
9b5fbc9b5e feat(pdftract-bf-2y2rp): implement lazy stream decoding for PDF extraction
- Add decode_page_content_streams() function for per-page lazy decode
- Update extract_page_from_dict() to support lazy stream decoding
- Modify extract_pdf() and extract_pdf_ndjson() to enable lazy decoding
- Fix borrow checker issue in LazyPageIter::next()

This ensures content streams are decoded lazily per page and dropped
immediately after processing, keeping peak RSS flat across page count.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-23 12:30:26 -04:00
jedarden
9fca24c77a docs(plan): SDKs are monorepo members, not separate repos
Add a Repository Layout subsection: SDK source lives at root-level pdftract-<lang>/
in this monorepo (single source of truth), generated via pdftract sdk codegen and
published to language registries from here. Retire the legacy standalone repos.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-22 07:21:45 -04:00
jedarden
2251f8a9c0 docs(plan): make bounded peak-RSS a CI-gated target; default max_decompress_bytes 2GB->512MB
Add a Memory targets table as a first-class acceptance criterion alongside
Accuracy/Speed/Weight, with a hard per-document peak-RSS ceiling that must not
scale with input/payload. Promote OOM-safety to a Tier-1 hard gate. Reconcile
the contradictory 2 GB max_decompress_bytes default to the research-backed 512 MB
(root cause of an observed multi-GB OOM via the unbounded PNG-predictor pre-alloc
under rayon page parallelism).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-21 23:25:50 -04:00
jedarden
bb5346b305 docs(pdftract-58kz): add security policy documentation
Add comprehensive SECURITY.md covering:
- Supported versions policy
- Private vulnerability reporting (email + GitHub)
- 90-day disclosure window with timelines
- CVE assignment via GitHub Security Advisories
- In-scope and out-of-scope vulnerability classes
- Safe harbor policy for good-faith researchers

Add security issue template redirecting users to private reporting.
Add Security section to CONTRIBUTING.md and README.md with links to SECURITY.md.
Add docs/security/pgp-public-key.asc placeholder with generation instructions.

References: bead pdftract-58kz, plan line 3433

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-20 19:39:24 -04:00
jedarden
9456d8e231 feat(pdftract-5omc): implement per-language conformance test runner pattern
Implements the conformance test runner pattern for all 10 SDKs as specified
in the plan (line 3547). Each SDK now has a dedicated conformance test runner.

Created:
- tests/sdk-conformance/report-schema.json: JSON schema for conformance reports
- docs/notes/sdk-conformance-runner.md: Pattern documentation and reference
- crates/pdftract-cli/tests/conformance.rs: Rust cargo test target
- tests/conformance/test_conformance.py: Python pytest harness
- tests/conformance/conformance.test.ts: Node.js vitest runner
- tests/conformance/conformance_test.go: Go go test runner
- tests/conformance/ConformanceTest.java: Java JUnit 5 runner
- tests/conformance/ConformanceTests.cs: .NET xUnit runner
- tests/conformance/conformance.c: C standalone binary
- tests/conformance/conformance_test.rb: Ruby minitest runner
- tests/conformance/ConformanceTest.php: PHP PHPUnit runner
- tests/conformance/ConformanceTests.swift: Swift XCTest runner

All runners implement:
- Loading of tests/sdk-conformance/cases.json
- Execution of test cases with language-native method invocations
- Comparison of results against expected values with numeric tolerances
- Emission of machine-readable conformance-report.json
- Non-zero exit on failures/errors for CI gating

Acceptance criteria:
- PASS: All 10 SDKs have language-specific runners
- PASS: Runners consume shared cases.json
- PASS: Runners emit JSON reports matching schema
- PASS: Runners exit non-zero on failure
- WARN: README integration pending SDK repo creation
- WARN: Stub implementations return placeholder results

References:
- Plan line 3547: "Every SDK has a pdftract-sdk-conformance test runner"
- Plan line 3589: "Conformance suite results published as Argo artifact"

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Bead-Id: pdftract-5omc
2026-05-18 01:32:24 -04:00
jedarden
857f928732 feat(pdftract-5omc): implement SDK conformance test runner pattern
Implement the conformance test runner pattern that every SDK will
implement to validate against the shared test suite.

- Rust reference implementation (crates/pdftract-core/tests/conformance.rs)
  * Full test suite loader and executor
  * Comparison engine with min/max, string constraints, tolerances
  * Skip logic for unsupported features and schema versions
  * Report generation in JSON format

- CLI compare subcommand (crates/pdftract-cli/src/main.rs)
  * pdftract compare - Compare actual vs expected with tolerances
  * Cross-language comparison tool to avoid reimplementations

- Documentation (docs/conformance/sdk-contract.md)
  * Complete pattern specification with pseudocode
  * Per-language runner locations
  * CI integration requirements

- Python reference stub (tests/python-conformance/test_conformance.py)
  * Full pytest-based implementation following the pattern

Closes: pdftract-5omc
2026-05-18 01:22:23 -04:00
jedarden
a34f9c18d0 docs(pdftract-1g87): create mdBook scaffolding for user documentation
- book.toml with title, authors, build directory, edit-url-template
- src/SUMMARY.md with complete TOC for all planned sections
- src/introduction.md: what pdftract does and doesn't do (Non-Goals)
- src/installation.md: cargo, pip, Homebrew, Docker; KU-12 caveat verbatim
- src/quickstart.md: five-minute walkthrough with executable commands
- 39 draft placeholder files for CLI reference, schema, profiles, SDKs, advanced topics, troubleshooting, FAQ

mdbook build completes cleanly with zero warnings (linkcheck optional).

See notes/pdftract-1g87.md for verification details.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-18 00:38:51 -04:00
jedarden
5e66846288 docs(pdftract-147a): author SDK contract specification
Add comprehensive SDK contract specification at docs/notes/sdk-contract.md.
This document serves as the constitutional specification for all pdftract
SDK implementations across all languages.

The contract defines:
- Method surface (9 methods mirroring CLI/MCP tools)
- Error mapping (CLI exit codes → native exceptions)
- Versioning compatibility rules (MAJOR lock, MINOR flexibility)
- Option-naming conventions (CLI flag → language-native case)
- Native type-mapping requirements (Document, Page, Span, Block, Match, Fingerprint, Classification)
- Async conventions per language
- Conformance enforcement (100% pass required)
- Change policy (ADR required for contract changes)

Verification note: notes/pdftract-147a.md

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-17 23:13:55 -04:00
jedarden
9f27d16f25 docs(phase-0.1): verify pdftract-ci scaffolding complete
Verified the pdftract-ci WorkflowTemplate exists in declarative-config
and is correctly synced to the iad-ci cluster. All scaffolding
requirements met for Phase 0.1.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-17 03:24:36 -04:00
jedarden
7035706068 docs(plan): fix 3 HIGH gaps + 3 LOW items from Round 5 gap review
HIGH:
- Add outline/bookmark traversal spec to Phase 1.4 (linked list walk, PDFDocEncoding vs UTF-16BE)
- Specify base64 encoding for attachment data field in Phase 7.5
- Move decompression limit to ExtractionOptions.max_decompress_bytes (universal, not serve-only);
  add max_decompress_gb to CLI/Python/HTTP API surfaces

LOW:
- Split log+env_logger into two dep matrix rows for accurate crate count
- Add full_render to Python keyword args and HTTP form fields (with no-op note)
- Clarify v0.1.0 milestone: "all applicable" targets (OCR speed target excluded until v0.2.0)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-16 18:30:02 -04:00
jedarden
2ba51a8a73 docs(plan): fix 4 gaps from Round 4 gap review
- Fix quick-xml feature gate: move from ocr to default (XMP conformance detection)
- Make page_number schema update an explicit Phase 6.1 deliverable
- Add PageClass → page_type mapping table; define broken_vector as valid output value
- Fix CI test matrix: musl target excludes ocr/python features; glibc runs --all-features

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-16 18:24:12 -04:00
jedarden
2d194a4b1b docs(plan): fix 15 gaps from Round 3 gap review
HIGH:
- Add fontdue crate for glyph rasterization (ttf-parser is a parser, not rasterizer)
- Remove num_cpus reference (rayon default pool sizing is sufficient)
- Update dep count target to < 30 direct crates (< 20 was violated by plan's own list)
- Fix watermark deferral: Phase 7 not Phase 6; no kind:'watermark' until Phase 7
- Add Phase 7.6 (Hyperlink/Annotation Extraction) and 7.7 (Article Thread Chains)

MEDIUM:
- Document header/footer streaming mode limitation: first 3 pages emit as paragraph
- Add conformance/XFA detection spec to Phase 1.4; move quick-xml to default feature
- Clarify pdftract-py-ci is Phase 0 stub, filled in during Phase 6.3
- Specify /Contents array concatenation in Phase 1.4 page tree
- Add page rotation un-rotation step after Phase 3 glyph bbox computation
- Add password delivery: ExtractionOptions.password, --password CLI, HTTP form, Python kwarg
- Fix glyph shape DB: phf::Map → sorted &'static [(u64,char)] slice for Hamming nearest-neighbor
- Add Python benchmark runner infrastructure (python:3.11-slim, requirements.txt, hyperfine)
- Add wordlist-bloom to Feature flags bullet list

LOW:
- Clarify extract_stream() yields page dicts only, not header/footer frames

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-16 18:18:33 -04:00
jedarden
eb799c0956 docs(plan): fix 21 gaps from Round 2 gap review
CRITICAL:
- Fix deskew step: pixDeskew operates on grayscale, not binarized image

HIGH:
- Add sha2 crate to dep matrix (needed for font fingerprint hashing)
- Fix bloomfilter feature: wordlist-bloom (optional), not default conditional
- Add build-dependencies subsection (phf_codegen, serde_json)
- Add v0.1.0 fallback for tagged PDFs: XY-cut with TAGGED_PDF_STRUCT_TREE_DEFERRED diagnostic
- Add v0.1.0 fallback for BrokenVector: emit broken_vector page_type when ocr feature absent
- Add strsim crate for Levenshtein in header/footer deduplication
- Add tokio::task::spawn_blocking bridge for axum→rayon hand-off
- Fix JPX/CCITT OCR path: document full-render requirement; add OCR_JPX/CCITT_UNSUPPORTED diagnostics
- Fix OCG default visibility: /D/BaseState + /D/ON + /D/OFF (was incorrectly /D/AS)

MEDIUM:
- Add JBIG2 OCR limitation: requires full-render; OCR_JBIG2_UNSUPPORTED diagnostic
- Add Standard-14 font skip for Level 3 fingerprinting (no embedded program)
- Change flags field from EnumSet<SpanFlag> to u8 bitmask (removes undocumented enumset dep)
- Add tests/fixtures/encoding/ and tests/fixtures/perf/ to Tier 2 fixture list
- Add ocg_present to Phase 6.1 metadata field list
- Add "Phase 7 feature; empty in Phase 6" notes to links and annotations fields
- Add include_invisible/extract_forms/extract_attachments to HTTP serve form fields
- Clarify Phase 4.7 lang source: document-level /Lang, not per-span (Phase 7)
- Fix Stage 1-2 → Phases 1-2 in architectural decisions (stale draft terminology)
- Remove frame-index notation from NDJSON streaming critical test
- Define Hybrid threshold (≥15% each) and Hybrid merge strategy (vector wins on overlap)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-16 18:05:26 -04:00
jedarden
bcccc98fd7 docs(plan): fix 30 gaps from Round 1 gap review
CRITICAL fixes:
- Remove jpeg-decoder from Phase 1.5 crates (contradicted dep matrix)
- Specify word boundary adaptive threshold: text space, per-font-switch window, 20-glyph seed
- Add page_number (1-based) alongside page_index (0-based) to resolve SDK/schema mismatch
- Add mcid: Option<u32> to Glyph struct (was defined in 3.4 but missing from 3.2)
- Add aes + rc4 crates under new decrypt feature; document crypto dependency

HIGH fixes:
- Specify font fingerprint database format (phf::Map, SHA-256, ~500KB, JSON source)
- Fix Level 4 shape DB cross-ref (was "Phase 2.3", corrected to research doc); add Phase 2.5 definition
- Document header/footer cross-page pass as sequential post-rayon with Levenshtein matching
- Replace Tesseract box-file hint approach with PSM_SPARSE_TEXT + post-OCR validation
- Add HTTP serve security constraints: decompression bomb limit, auth guidance, no path params
- Add JavaScript detection spec to Phase 1.4 (all four JS action locations)
- Align CI benchmark gate to 10x pdfminer.six (was 5x, contradicted primary objectives)
- Add cargo bloat CI gate for phf word list size; bloomfilter fallback if >250KB
- Add pdftract-py-ci WorkflowTemplate note with manylinux/osxcross/cross approach
- Add ConfidenceSource enum → schema string mapping table in Phase 4.1

MEDIUM fixes:
- Define docs/schema/v1.0/pdftract.schema.json as Phase 6.1 deliverable
- Add unicode-bidi crate to dep matrix and Phase 4.2 for RTL detection
- Define Color enum with CSS hex conversion rules in Phase 3.1
- Remove bytes crate from Phase 1.2 (belongs in serve feature only; use Arc<[u8]>)
- Specify NDJSON buffer Condvar blocking behavior at window saturation
- Clarify pdftract:ocr vs pdftract:full Docker image tags and size budgets
- Add Docstrum parameters: k=5, Euclidean, ±30° constraints, root node definition
- Add code and formula block kind detection heuristics to Phase 4.4
- Add OCG visibility handling to Phase 1.4 (ON/OFF from /OCProperties /D /AS)
- Add linearized PDF detection and dual-xref merge to Phase 1.3
- Add HTTP 413 to error table with custom JSON rejection handler
- Add Phase 0: CI Infrastructure section (pdftract-ci WorkflowTemplate)

LOW fixes:
- Clarify Name length limit: 127 bytes pre-expansion, matching PDF spec 7.3.5
- Reorder preprocessing pipeline: contrast normalization before binarization (was after)
- Add CIDToGIDMap stream form: 2-byte big-endian GID array

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-16 17:45:04 -04:00
jedarden
d161d109b3 docs(plan): revise plan to center accuracy/speed/weight as hard targets
- Add Primary Objectives section with CI-gated measurable targets:
  accuracy (CER <0.5%, WER <3%, readability >0.85), speed (100pp <3s,
  10x vs pdfminer), weight (<4MB default binary, <20 default deps)
- Add feature-flag strategy: axum/tokio/pdfium/pyo3 are all optional;
  default build is core extraction + CLI only
- Add Phase 4.7: text readability validation and correction pipeline
  (ligature repair, hyphenation, mojibake detection, readability scoring)
- Make pdfium-render explicitly optional (full-render feature) vs. the
  always-present direct image compositing path
- Add Tier 4 competitive benchmark suite (vs. pdfminer.six, pypdf, pdfplumber)
- Remove jpeg-decoder and whichlang from dependency matrix (unnecessary)
- Rename implementation-plan.md → plan.md (matches CLAUDE.md reference)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-16 17:07:48 -04:00
jedarden
8753630bc3 Add parallel extraction research and comprehensive research index
New research document covering parallel extraction architecture:
rayon page-level parallelism, Arc<> shared xref/font/object-stream
caches, RwLock font cache design, Tesseract thread-local OCR pool,
semaphore memory budget, ordered NDJSON streaming slot array, and
catch_unwind error isolation per page.

Also adds docs/research-index.md: a 622-line navigable index of all
83 research documents grouped into 9 thematic categories, with a
"Start Here" reading path, per-phase implementation reading tables,
and an alphabetical lookup table covering every document.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-16 16:30:35 -04:00
jedarden
92e6196ac5 Add research: Ruby/furigana typography, PDF/VT variable printing
Two new research documents covering Japanese Ruby text and East Asian
typography (tagged/untagged furigana extraction, Kinsoku Shori spacing,
full-width normalization, tate-chu-yoko, CJK/Latin boundary detection,
ruby_text output field) and PDF/VT variable and transactional printing
(DPart hierarchy traversal, per-record extraction model, DPM metadata,
variable vs. static content classification, postal address extraction,
records array output schema).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-16 16:24:21 -04:00
jedarden
e3b72efc83 Add research: Southeast Asian scripts, OpenType MATH formula extraction
Two new research documents covering Southeast Asian script extraction
(Thai/Khmer/Myanmar/Lao/Tibetan/Ethiopic — cluster structure, no-space
word boundary policy for Thai/Lao, Zawgyi vs Unicode detection for
Myanmar, USE shaping, Tesseract fallback) and OpenType MATH table
exploitation for formula extraction (MathConstants for fraction/
subscript/radical layout, TeX OML/OMS/OMX encoding tables, MathML
output generation, GlyphAssembly reconstruction, alternative text
and MathJax XMP source recovery).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-16 16:21:48 -04:00
jedarden
4e72c66763 Add research: Indic scripts, adversarial parser security
Two new research documents covering Indic script extraction (abugida
structure, ToUnicode CMap failures for shaped glyphs, ActualText
fast-path, GSUB lookup reversal, pre-base matra reordering, virama
placement, Tesseract fallback with script-specific models) and
adversarial input handling (decompression bombs, circular references,
malformed stream lengths, path traversal in attachments, content stream
loop detection, O(n log n) algorithm requirements, output sanitization).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-16 16:18:03 -04:00
jedarden
12fad41596 Add research: span merging, Unicode normalization, implementation plan
Two new research documents covering the glyph-to-span-to-block assembly
pipeline (inter-operator merging, adaptive word gap threshold, column
detection, ligature bbox splitting, multi-granularity output) and
Unicode post-processing (NFC normalization, selective NFKC decomposition
for ligatures, PUA preservation, soft hyphen resolution, ZWJ/ZWNJ
handling, combining character reordering).

Also adds docs/plan/implementation-plan.md: the full 7-phase Rust
implementation roadmap covering core parser, font/encoding pipeline,
content stream processing, text assembly, OCR integration, API surface,
and advanced features — with crate selections, complexity ratings,
test strategy, and v0.1–v1.0 release milestones.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-16 16:15:14 -04:00
jedarden
6b96d8d637 Add research: error handling, PDF/A guarantees, output schema, generator quirks
Four new extraction research documents covering permissive error handling
with extraction quality signaling (five error classes, circular reference
detection, memory limits), PDF/A conformance level guarantees and
fast-path optimization (Level A skips OCR and layout heuristics), the
complete extraction output schema (span/block/table/NDJSON streaming/
versioning), and per-generator extraction quirks (Word/LibreOffice/
InDesign/LaTeX/Chrome/Ghostscript/scanners).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-16 16:07:13 -04:00
jedarden
a89fef64fc Add research: article threads, resource dictionaries, catalog, hyperlinks
Four new extraction research documents covering PDF article thread
traversal for multi-flow magazine layouts, resource dictionary
inheritance and ResourceStack semantics for nested Form XObjects,
document catalog and page tree structure (UserUnit, Contents array,
page inheritance), and hyperlink/named destination extraction with
QuadPoints anchor text and link density classification.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-16 16:04:00 -04:00
jedarden
16cb1bd61d Add research: xref parsing, object model, font descriptors, PDF/UA-2
Four new extraction research documents covering cross-reference table
and xref stream parsing with error recovery, PDF object model and lexer
correctness (all 8 types, string escapes, stream /Length recovery),
FontDescriptor fields and embedded font data (Type1/TrueType/CFF/OT),
and PDF/UA-2 / PDF 2.0 structure changes (MathML, NFC normalization,
new structure types, artifact classification improvements).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-16 16:01:34 -04:00
jedarden
6c6ec6a4ca Add research: color management, text metrics, PDF/X, content stream operators
Four new extraction research documents covering ICC profile and color
space luminance estimation for text visibility, precise text state
tracking and bounding box computation (Tc/Tw/Tz/TL, font units, TJ
kerning, baseline clustering), PDF/X prepress handling (OutputIntent,
TrimBox, spot colors, article threading), and a complete content stream
operator reference (BT/ET, Tj/TJ/'/", BI/ID/EI, BX/EX, marked content).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-16 15:59:02 -04:00
jedarden
516ca154aa Add research: page labels, government forms, book publishing, filter decoding
Four new extraction research documents covering page label/PageLabels
number tree and outline/bookmark tree extraction, government form PDF
patterns (IRS, USCIS, court filings, classification markings), book and
publishing PDF structure (running heads, footnotes, index extraction),
and PDF stream filter pipeline (FlateDecode/LZW predictors, JBIG2 global
segments, CCITTFax, JPX, error boundaries).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-16 15:55:08 -04:00
jedarden
5ff918b178 Add research: portfolios, incremental updates, tagged PDF, JavaScript/forms
Four new extraction research documents covering PDF portfolio and
attachment enumeration (ZUGFeRD, PDF/A-3 AFRelationship), incremental
update structure and xref chaining, PDF/UA tagged PDF deep dive with
all 36 structure types and MCID mechanics, and JavaScript/AcroForm/XFA
field extraction without script execution.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-16 15:45:59 -04:00
jedarden
006dfb286c Add research: color visibility, medical/scientific, multilingual, digital signatures
Four new extraction research documents covering color space and contrast
analysis for text visibility, medical/scientific document structure
(ICH E3, IMRaD, FDA labeling, eCTD), multilingual mixed-script extraction
with UBA bidi handling and CJK vertical text, and digital signature
metadata extraction with DocMDP integrity context.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-16 15:41:43 -04:00
jedarden
eac3235291 Add research: rendering modes, legal/financial patterns, confidence scoring, engineering docs
Four new extraction research documents covering text rendering modes
(Tr 0-7 including invisible OCR layers), legal/financial document
extraction patterns, character-level confidence aggregation with output
schema, and PDF/E engineering document handling (CAD, GD&T, schematics).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-16 15:35:48 -04:00
jedarden
8f8138a65e Add research: font subsetting, LaTeX patterns, redaction detection
Three new extraction research documents covering subset font Unicode
recovery, pdfLaTeX/XeLaTeX encoding tables and two-column layout, and
proper vs. improper redaction detection with output schema.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-16 15:30:52 -04:00
jedarden
04b60a1cf7 Add three research documents: CJK encoding, pipeline synthesis, linearization
- cjk-and-asian-script-encoding: all six CJK encoding systems, Type 0
  composite font pipeline, predefined CMap tables for Japan1/GB1/CNS1/Korea1,
  Shift-JIS/GB18030/Big5 byte structure, missing ToUnicode recovery via
  Adobe CID tables, full-width normalization, vertical text detection
- extraction-pipeline-overview: end-to-end 9-stage synthesis referencing
  all 36 research documents; stages: file open, metadata, page classification,
  content extraction (4 sub-paths), font pipeline, span assembly, normalization
  and quality, supplementary content, output serialization; ASCII data-flow
  diagram
- linearized-pdf-and-streaming: linearization dict keys, hint stream
  bitfield tables, first-page xref lazy parsing, HTTP range request pattern,
  staleness validation, incremental update interaction, NDJSON streaming,
  partial file extraction, lazy PageIter API with rayon par_bridge

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-16 15:26:36 -04:00
jedarden
116db89c95 Add three research documents on routing and text reconstruction
- word-boundary-reconstruction: expected position formula with Tc/Tw/Tz,
  TJ kerning gap detection, Td/Tm jump analysis, four space-width threshold
  strategies including adaptive histogram, multi-column gap discrimination
- scanned-vs-vector-page-classification: four-category taxonomy, fast
  pre-checks, image coverage AABB computation, character density ratio,
  validity rate, glyph bbox plausibility, region routing map, confidence
  scoring with cost-aware OCR threshold
- pdfa-compliance-and-extraction: ISO 19005 part/level matrix, XMP
  pdfaid detection, Level B/U/A guarantee implications for extraction,
  font embedding requirements, artifact tagging, PDF/A-3 embedded files,
  PdfaLevel enum with per-level fast-path branching

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-16 15:22:08 -04:00
jedarden
9420964b73 Add three research documents on parser correctness fundamentals
- graphics-state-tracking: full q/Q stack, text state operators, color
  space tracking, ExtGState keys, clip path management, CTM concatenation,
  blend mode/soft mask visibility, Form XObject isolation, GraphicsState
  Rust struct with is_text_visible implementation
- cmap-format-and-cid-encoding: CMap file structure, codespace range
  scan grammar, bfchar/bfrange/cidchar/cidrange semantics, usecmap
  inheritance with predefined CJK CMap inventory, mixed-length parsing
  state machine, ToUnicode defect handling, Rust CMap struct design
- content-stream-concatenation: multi-stream concatenation with 0x0A
  injection, continuous graphics state across boundaries, resource
  inheritance page-tree walk, Form XObject and Type 3 resource isolation,
  ResourceStack design, EI disambiguation in binary data, lazy decompression

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-16 15:16:41 -04:00
jedarden
f805e52fa3 Add four research documents focused on readable text production
- type3-font-extraction: CharProcs stream parsing, TeX/dvips naming
  conventions, dHash shape fingerprinting, nested font stacks, OCR fallback
- watermark-and-background-separation: five PDF watermark mechanisms,
  transparency tracking, cross-page repetition, WCAG contrast detection,
  raster inpainting, diagonal watermark removal pipeline
- historical-and-degraded-document-extraction: eight degradation categories,
  bleed-through removal, illumination correction, Sauvola binarization,
  stroke reconstruction, Fraktur/long-s handling, confidence-gated output
- complex-layout-reading-order: baseline clustering, XY-cut, Docstrum,
  RLSA smearing, mixed-layout detection, sidebar/inset/footnote ordering,
  perplexity-based confidence with natural_order fallback

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-16 15:13:10 -04:00
jedarden
31e715633d Add four research documents on text quality and document-type handling
- text-readability-validation: character/word/entropy/perplexity checks,
  symbol font detection, remediation decision tree, span quality metadata
- post-ocr-text-correction: error taxonomy, confusable tables, noisy channel
  n-gram model, regex patterns, hyphenation, layout-based correction pipeline
- presentation-and-spreadsheet-pdfs: detection heuristics, slide structure,
  bullet hierarchy, speaker notes, hairline grid detection, sheet boundaries,
  cell type inference, Rust output schema
- semantic-text-reconstruction: beam search n-gram reconstruction, NER
  validation, domain lexicons, cross-span consistency, abbreviation expansion,
  citation repair, coherence scoring, ReconstructedSpan output schema

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-16 15:07:30 -04:00
jedarden
a7673c906f Add 12 research documents covering full PDF extraction surface
Infrastructure and parsing:
- raster-ocr-pipeline: trigger detection, preprocessing, Tesseract integration,
  assisted OCR, HOCR alignment, multi-language, performance
- image-and-figure-extraction: XObjects, inline images, filter decoding,
  color spaces, geometry, form XObjects, transparency, figure detection
- form-fields-and-annotations: AcroForm types, XFA, widget appearance
  streams, rich text, annotation text, output schema
- pdf-encryption-and-security: R2-R6 key derivation, object-level
  decryption, permission flags, RustCrypto implementation approach
- page-geometry-and-document-structure: page tree, all five page boxes,
  rotation, coordinate inversion, page labels, outlines, named destinations
- optional-content-groups: OCG/OCMD visibility, usage dictionary, default
  state resolution, content stream marking, multilingual layer patterns
- invisible-and-hidden-text: all 8 Tr modes, PDF/A invisible layer pattern,
  white-on-white, zero-opacity, clipped text, color tracking
- malformed-pdf-repair-and-recovery: xref recovery, stream length repair,
  syntax tolerance, partial extraction, structured warnings

Quality and metadata:
- xmp-and-document-metadata: /Info vs XMP, all namespaces, RDF/XML
  parsing, conflict resolution, encrypted metadata, thumbnails
- embedded-files-and-portfolios: EmbeddedFile streams, Filespec,
  AF relationships, Portfolio detection, ZUGFeRD/Factur-X, security
- performance-and-streaming-architecture: mmap, lazy loading, NDJSON
  streaming, rayon parallelism, font caching, axum HTTP server
- benchmark-and-test-methodology: CER/WER/TEDS metrics, corpus
  categories, reading order scoring, regression CI, public datasets

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-16 15:05:42 -04:00
jedarden
b805593973 Add six research documents covering output-side extraction topics
- table-structure-reconstruction: line detection, gap analysis, Hough
  transform, graph-based cell reconstruction, merged cells, multi-page tables
- mathematical-expression-handling: five encoding cases, OpenType MATH table,
  symbol font recovery, spatial heuristics, LaTeX reconstruction, fallback tiers
- language-detection-and-script-handling: UAX #24/#9, Arabic/Hebrew bidi,
  CJK vertical text, ligature normalization, whatlang/lingua integration
- document-classification-and-zone-labeling: margin heuristics, font
  clustering, cross-page recurrence, footnote/caption/sidebar detection
- post-extraction-normalization: hyphen handling, ligature expansion,
  paragraph reconstruction, Unicode normalization, pipeline ordering
- chunking-for-llm-consumption: semantic snapping, heading hierarchy,
  sliding window overlap, table chunking strategies, token budget, late chunking

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-16 14:56:25 -04:00
jedarden
ef9c03095d Add SDK architecture notes covering top 10 languages
Covers TypeScript, C#, C++, PHP, and Kotlin gaps with full code examples
for both subprocess and HTTP tracks, NuGet RID packaging detail, PHP FFI
options, and implementation sequencing.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-16 14:51:25 -04:00
jedarden
c2870e6640 Add research docs and SDK invocation notes
Four research documents covering PDF spec fundamentals, font types and
encoding, glyph Unicode recovery, and tagged PDF structure/reading order.
SDK invocation notes with subprocess and HTTP examples for Python, Node.js,
Go, Ruby, Java, Rust, and Bash.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-16 14:33:34 -04:00
jedarden
4ae798c8b1 Initial repo scaffold with README and docs structure
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-16 14:26:16 -04:00