Commit graph

71 commits

Author SHA1 Message Date
jedarden
0dbbbf967f feat(pdftract-30ahi): configure maturin for 5-target wheel builds
Some checks are pending
Schema Generation Validation / Validate JSON Schema (push) Waiting to run
Schema Generation Validation / Validate JSON Syntax (push) Waiting to run
Configure maturin to build Python wheels for 5 target triples using
cross-compilation from a single Linux runner. Enable ABI3 for forward
compatibility across Python 3.10+.

Changes:
- pyproject.toml: Set requires-python = ">=3.10" (down from 3.11)
- pyproject.toml: Add Python 3.10 classifier
- pyproject.toml: Update comment to reflect 3.10+ compatibility
- Cargo.toml: Add pyo3 abi3-py310 feature
- docs/operations/build-wheels.md: Document cross-compilation setup

Target triples:
- x86_64-unknown-linux-gnu (manylinux_2_28_x86_64)
- aarch64-unknown-linux-gnu (manylinux_2_28_aarch64)
- x86_64-apple-darwin (macosx_11_0_x86_64)
- aarch64-apple-darwin (macosx_11_0_arm64)
- x86_64-pc-windows-gnu (win_amd64)

All wheels will be ABI3 (cp310-abi3) compatible, producing a single
wheel per platform instead of N versions × 5 platforms.

Refs: pdftract-30ahi, Phase 6.3.4
2026-05-28 08:04:32 -04:00
jedarden
23322f79d1 feat(pdftract-2qw5j): add explicit enum constraints to JSON Schema
Some checks are pending
Schema Generation Validation / Validate JSON Schema (push) Waiting to run
Schema Generation Validation / Validate JSON Syntax (push) Waiting to run
Add explicit enum constraints to page_type, severity, and confidence_source
fields in the generated JSON Schema for better validation.

Changes:
- Modified xtask/src/bin/gen_schema.rs to add explicit enum constraints
  during schema generation via add_enum_constraints() function
- page_type enum: ["text", "scanned", "mixed", "broken_vector", "blank", "figure_only"]
- severity enum: ["info", "warning", "error", "fatal"]
- confidence_source enum: ["native", "heuristic", "ocr"]
- Regenerated docs/schema/v1.0/pdftract.schema.json with enum constraints
- Added .github/workflows/schema-gen.yml CI workflow for schema validation

The CI workflow validates:
1. Generated schema matches committed file (fails on diff)
2. JSON syntax is valid
3. Schema structure is correct ($id, $schema, title, $defs)
4. Enum constraints are present and have correct values

This ensures schema changes are reviewable in PRs and forces
developers to commit the updated schema when type definitions change.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-28 02:47:54 -04:00
jedarden
ae9e478405 docs(pdftract-2qw5j): regenerate JSON schema from updated Rust types
The schema now reflects the latest doc comments from the Rust types,
including updated descriptions for annotations and other fields.

Changes:
- AnnotationJson description updates (phase 7.6.4 reference)
- Format consistency updates (float vs double)
- Subtype-specific field documentation

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-28 02:25:00 -04:00
jedarden
43e2e5a399 docs(pdftract-2bfgc): add sample nginx and Traefik reverse-proxy configs
Add two example reverse-proxy configuration files to help operators
deploy pdftract serve with TLS and authentication in front of the
no-auth pdftract server.

- docs/operations/serve-nginx-example.conf: nginx config with Basic Auth,
  proxy_pass to localhost:8080, /extract and /health endpoints
- docs/operations/serve-traefik-example.yaml: Traefik dynamic config with
  BasicAuth middleware, buffering limits, separate health router

Both configs include top comments explaining the deployment model:
pdftract serve binds to 127.0.0.1:8080 with no auth; the reverse
proxy provides TLS termination and authentication.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-28 00:37:34 -04:00
jedarden
90d1b9a83d test(pdftract-4c8qu): add page_label tests and fix JSON schema
- Add test_page_json_with_page_labels_roman_numerals: verifies page_label
  serialization with roman numeral values (i, ii, iii, etc)
- Add test_page_json_without_page_labels_absent: verifies page_label is
  absent (null) when PDF has no /PageLabels
- Add test_page_json_page_index_and_page_number_both_present: verifies
  both page_index and page_number are always present and page_number = page_index + 1
- Add test_page_json_roundtrip_with_all_fields: verifies full roundtrip
  serde preservation of all PageJson fields

- Update docs/schema/v1.0/pdftract.schema.json PageResult definition:
  - Add page_number field (1-based, = page_index + 1)
  - Add page_label field (optional, from /PageLabels number tree)
  - Add width and height fields (page geometry in points)
  - Add rotation field (0, 90, 180, 270 degrees)
  - Add type field with enum: text, scanned, mixed, broken_vector, blank, figure_only
  - Update required fields to include all page-level fields

Acceptance criteria:
 Page serializes with both page_index AND page_number
 PDF with /PageLabels [{S: "r"}] produces page_label "i", "ii", "iii" etc
 PDF without /PageLabels -> page_label absent
 JSON Schema enum for page_type includes all values
 Roundtrip serde Page test passes

Closes: pdftract-4c8qu
2026-05-25 14:43:31 -04:00
jedarden
9abc386cce feat(pdftract-3h9xo): implement threads JSON output + schema integration
Phase 7.7.3: Add threads field to ExtractionResult with ThreadJson schema integration.

Changes:
- Added ThreadJson and BeadJson structs to schema/mod.rs
- Added thread_to_json() function to threads/mod.rs
- Added build_page_ref_to_index() helper to parser/pages.rs
- Added threads field to ExtractionResult in extract.rs
- Implemented Phase 7.7 extraction logic with discover_threads/walk_beads
- Added threads_to_markdown() and collapse_page_ranges() to markdown.rs
- Updated JSON schema with ThreadJson and BeadJson definitions
- Added thread_to_py() and bead_to_py() conversions in pdftract-py
- Exported ThreadJson, BeadJson from lib.rs

All 32 threads module tests pass. All 35 markdown tests pass.

Verification: notes/pdftract-3h9xo.md

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-25 13:40:15 -04:00
jedarden
2be802aca5 feat(pdftract-2u6q2): implement diagnostic infrastructure
Add DiagnosticsCollector type for thread-safe diagnostic aggregation,
add hint field to DiagnosticJson, add missing error codes
(IMG_SOURCE_MIXED, PROFILE_INVALID, REPAIR_RESCUED_FROM_BACKWARDS_XREF),
and create comprehensive diagnostics documentation.

Changes:
- DiagnosticsCollector: Arc<Mutex<Vec<Diagnostic>>> wrapper with emit()
  helpers for emitting diagnostics from multiple threads
- DiagnosticJson: add hint: Option<String> field for suggested actions
- DiagCode: add ImgSourceMixed, ProfileInvalid, RepairRescuedFromBackwardsXref
- docs/integrations/diagnostics-codes.md: comprehensive code catalog

Closes: pdftract-2u6q2
2026-05-25 13:16:38 -04:00
jedarden
6000c654ce fix: resolve compilation errors across codebase
- Fixed missing fields in BlockJson, SpanJson, ExtractionOptions initializations
- Added feature gates to ocr_integration tests for conditional compilation
- Fixed McpServerState::new calls to include audit writer argument
- Fixed CCITTFaxDecoder::decode calls to use instance method
- Fixed type casts for ObjRef::new calls
- Fixed serde_json::Value method calls (is_some -> !is_null)
- Fixed ProfileType test feature gates
- Worked around lifetime issues in schema roundtrip tests

These changes fix numerous compilation errors that were blocking the
codebase from building. The main library and tests now compile successfully.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-25 08:38:04 -04:00
jedarden
b7851b9d92 feat(pdftract-4hle): implement 7.6.4 links + annotations JSON output
Add JSON conversion functions, schema integration, and extraction
pipeline wiring for Phase 7.6 hyperlink and annotation extraction.

Changes:
- Create annotation/json.rs with conversion functions (link_to_json,
  annotation_to_json, fit_type_to_json, sort_links, sort_annotations)
- Add 13 comprehensive tests covering all link/annotation types
- Wire Phase 7.6 annotation extraction into main extract.rs pipeline
- Update docs/schema/v1.0/pdftract.schema.json with LinkJson,
  AnnotationJson, DestArrayJson, DestTypeJson, AnnotationSpecificJson
- Add links to root schema properties and required fields
- Add annotations array to PageResult

Schema definitions include all 8 PDF fit types (XYZ, Fit, FitH, FitV,
FitR, FitB, FitBH, FitBV) and all major annotation subtypes (TextMarkup,
Stamp, FreeText, Text, Ink, Line, Polygon, FileAttachment).

Closes pdftract-4hle (7.6.4)

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-25 07:44:12 -04:00
jedarden
4ec9ff7470 docs(pdftract-5boam): add JSON schema reference page
- Created comprehensive json-schema-reference.md with:
  - Top-level structure documentation
  - Document metadata, page result, span, block fields
  - Table structure (row/cell) with examples
  - Form fields and signatures (Phase 7 placeholders)
  - Receipts and coordinate system docs
  - Cross-references to plan sections (INV-11, Phase 6.1, etc.)
- Added to mdBook SUMMARY.md as top-level reference page
- All examples use real JSON from the schema
- Builds successfully (46KB HTML output)

Acceptance criteria:
- PASS: docs/user-docs/src/json-schema-reference.md exists
- PASS: Covers all top-level types and enums (Document, Page, Span, Block, Table, FormField, Signature, Receipt)
- PASS: Examples for each major type
- PASS: mdBook renders cleanly (verified)
- PASS: Cross-references to plan sections included

Closes: pdftract-5boam
2026-05-25 05:18:53 -04:00
jedarden
85863a244b docs(manual-release): add PB-13 fallback release runbook
Implement the manual release procedure for reproducing milestone
releases locally when Argo Workflows in iad-ci is degraded or
unavailable. This is the PB-13 fallback documented in the plan
(line 567) for the R13 risk register entry.

The runbook includes:
- Prerequisites (hardware, tools, cross-compilation toolchains)
- OpenBao secret paths for all release credentials
- 13-step release procedure covering:
  1. Tag verification
  2. Full CI suite run
  3. Cross-compilation for 5 target triples × 2 feature variants
  4. Binary verification
  5. SHA-256 checksum generation
  6. GPG signing of checksums
  7. Python wheel building (maturin)
  8. PyPI upload
  9. crates.io publishing (pdftract-core → pdftract-cli order)
  10. GitHub Release creation
  11. mdBook building
  12. Cloudflare Pages deployment
  13. SLSA Level 2 attestation generation
- Failure mode recovery procedures (triple build failure,
  PyPI upload failure, SLSA attestation failure)
- Idempotency and safe re-run rules per step
- Completion criteria (all channels must succeed)
- Continuity plan (written for a stranger)

Acceptance criteria:
- docs/operations/manual-release.md exists with all required sections
- Step-by-step procedure complete (all 13 steps)
- Manual release CHANGELOG record template present
- Failure modes documented for the three most likely partial failures
- Runbook is verbatim-executable by a non-author release lead

Closes: pdftract-4sj0
2026-05-25 03:23:29 -04:00
jedarden
47df769e4b feat(pdftract-5ls35): implement JSON-Lines output sink for grep
Implement the --json output sink for pdftract grep with JSON-Lines
format (one match per line). Includes MatchEvent, FileOnlyEvent,
CountEvent structs and JsonSink line-buffered writer.

Key features:
- MatchEvent with all fields (path, page_index, bbox, match_text,
  span_text, span_confidence, pdf_fingerprint, crosses_spans)
- crosses_spans omitted when false via skip_serializing_if
- NaN/Infinity in span_confidence replaced with null
- page_index is 0-based (machine convention)
- FileOnlyEvent for -l mode, CountEvent for -c mode
- Line-buffered writes with immediate flush
- JSON schema at docs/schema/v1.0/grep-jsonl.schema.json

Closes: pdftract-5ls35
2026-05-25 02:05:17 -04:00
jedarden
2ccdaecda1 docs(pdftract-5nare): add comprehensive FAQ with 24 questions
Added docs/user-docs/src/faq.md with 24 FAQ entries covering:
- General questions (what is pdftract, extract vs extract_text, JS execution)
- Installation and setup (proxy, system requirements)
- Usage (broken_vector, OCR speed, page ranges, images, batch processing)
- Configuration (custom profiles, OCR accuracy, confidence scores)
- Output formats (Markdown, tables, metadata, passwords)
- Troubleshooting (errors, empty output, debugging, memory usage)

Each answer is 1-3 paragraphs with cross-links to fuller docs.
mdBook builds successfully.

Acceptance criteria:
- PASS: docs/user-docs/src/faq.md exists
- PASS: 24 questions covered (target: 15-25)
- PASS: Each answer is 1-3 paragraphs
- PASS: Cross-links work
- PASS: mdBook renders cleanly

Closes: pdftract-5nare
2026-05-25 00:22:48 -04:00
jedarden
016c738188 feat(pdftract-5nv9h): implement xtask gen-schema with stable ordering and proper metadata
Implement the xtask gen-schema binary at xtask/src/bin/gen_schema.rs that
derives JSON Schema Draft 2020-12 from the Rust ExtractionResult type via
the schemars crate.

Changes:
- Add stable key sorting (sort_keys_recursive) for byte-identical output
- Set $id to stable URL: https://pdftract.com/schema/v1.0/pdftract.schema.json
- Set title to "pdftract Output v1.0"
- Add cargo alias `gen-schema` for convenient invocation
- Emit schema to docs/schema/v1.0/pdftract.schema.json

The schema is generated from the Rust types with schemars derives, ensuring
the JSON schema is always in sync with the source types.

Acceptance criteria:
- cargo gen-schema regenerates docs/schema/v1.0/pdftract.schema.json
- Generated schema validates against JSON Schema Draft 2020-12
- Schema $id is the stable URL
- Title is "pdftract Output v1.0"
- Stable ordering: regenerating twice produces byte-identical output
- All expected types appear in $defs (BlockJson, SpanJson, PageResult, etc.)

Note: page_type and confidence_source enums are not yet implemented in the
Rust types (marked as TODO in schema/mod.rs). These will be added by sibling
beads pdftract-1ob and pdftract-1f8we respectively.

Closes: pdftract-5nv9h
2026-05-24 17:31:16 -04:00
jedarden
84b4448648 feat(pdftract-5qca): implement form_fields JSON output + schema integration
Phase 7.4.5 implementation: Wire combined Vec<(String, FormFieldValue)> from
combiner into document-level /form_fields JSON output with tagged union schema.

- Add FormFieldJson, FormFieldTypeJson, FormFieldValueJson, ChoiceValueJson to schema
- Add form_fields: Vec<FormFieldJson> to ExtractionResult (always emitted, empty when none)
- Implement acro_field_to_value() converter for Phase 7.4.2 type-specific extraction
- Wire form field extraction in extract_pdf(): walk AcroForm, extract XFA, combine with XFA-wins
- Add convert_form_field_to_json() helper for FormFieldValue → FormFieldJson conversion
- Update docs/schema/v1.0/pdftract.schema.json with form_fields $defs and required field
- Add form_fields_to_markdown() to markdown module for Form Fields footer table

Schema shape: /form_fields is array of {name, type, value, default?, page_index?, rect?,
required, read_only, multiline?, max_length?, options?, multi_select?, selected?,
state_name?, pushbutton?, radio?}. Type field is tagged enum: "text", "button", "choice",
"signature". Value field varies by type (string|boolean|string|array|uint|null).

Closes: pdftract-5qca

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-24 14:36:03 -04:00
jedarden
d9d21df157 docs(pdftract-653ah): add runbook integration for pdftract doctor
- Created docs/operations/manual-platform-smoke.md with comprehensive
  smoke test runbook for KU-12 quarterly manual platform testing
- Added troubleshooting table covering all 14 doctor checks
- Cross-referenced runbook from installation.md and quickstart.md
- Added CI gate test (doctor_runbook_coverage.rs) to verify
  troubleshooting table completeness

Acceptance criteria:
✓ Step 1: pdftract doctor as first section in runbook
✓ Troubleshooting table covers all FAIL-capable checks
✓ installation.md mentions pdftract doctor with runbook link
✓ quickstart.md uses pdftract doctor as first example command
✓ CI gate parses runbook and asserts all checks are present
✓ mdBook build succeeds
✓ No broken internal links

Closes: pdftract-653ah
2026-05-24 13:26:31 -04:00
jedarden
b6b9ed74a2 docs(pdftract-3om3): add MCP client configuration guide
Add docs/integrations/mcp-clients.md with copy-paste-ready configuration
snippets for Claude Desktop, Cursor, Continue, and a custom SDK template.

Each section includes:
- Per-OS config file locations
- JSON/YAML snippets
- Validation steps
- Minimum client version verified

Also includes:
- Multi-client HTTP mode setup
- TH-03 compliance note (auth required for public binds)
- Troubleshooting for common failure modes
- Cross-references to sdk-invocation.md, KU-5, OQ-07

Closes: pdftract-3om3

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-24 13:10:33 -04:00
jedarden
eb025f7b1a docs(pdftract-3wrx): add release signing strategy note
Resolves OQ-10: document v1.0.0 stance on binary signing.
- Linux: GPG-signed (implemented)
- macOS: Deferred to v1.1+ ($99/yr Apple Developer Program)
- Windows: Deferred to v1.1+ ($200-400/yr Authenticode cert)
- All platforms: SLSA Level 2 attestation (already committed)

Closes: pdftract-3wrx

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-24 11:12:56 -04:00
jedarden
94b02dedfe docs(pdftract-1tjn): finalize OpenType MATH and formula extraction research note v1.0
- Add Section 11: Formula-Region Detection Algorithm with pseudo-code
- Add Section 12: Inline vs Display Formula Classification rules
- Add Section 13: LaTeX-Like Reconstruction (Best-Effort) with feature-flag guidance
- Add Section 14: Profile Classifier Signal `structural.has_math` definition
- Add Section 15: Validation Methodology with arXiv fixture corpus strategy

File grows from 168 to 426 lines. All acceptance criteria PASS.

Closes: pdftract-1tjn
2026-05-24 10:41:39 -04:00
jedarden
8d6a1a07df docs(pdftract-372e): finalize watermark and background separation research note v1.0
- Added Section 2: Combined Watermark Scoring Algorithm with signal definitions, pseudo-code, threshold tuning, and weight overrides
- Added Section 4: Font-Based Signals (font size, color, weight/family)
- Added Section 11: Text Output Mode behavior (pre/post Phase 7)
- Added Section 12: Edge Cases (stamps vs watermarks, raster watermarks, form profile override, reading-order interaction)
- Added Section 13: Validation Corpus with empirical baseline results
- Expanded Section 10 with WatermarkSignals struct containing individual signal scores
- File grows from 198 to 546 lines

Closes: pdftract-372e

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-24 10:33:37 -04:00
jedarden
e25a4fc78d docs(pdftract-10cf): finalize table structure reconstruction research note v1.0
Added complete pseudo-code listings for:
- Line-based grid reconstruction algorithm (path segment collection,
  collinear merging, intersection finding, cell synthesis)
- Borderless table detection via vertical projection profiles
  and column separator inference
- Cell content assignment via centroid containment

Also added version history section documenting v0.9 -> v1.0 changes.

Closes: pdftract-10cf
2026-05-24 09:58:03 -04:00
jedarden
57df42f478 docs(pdftract-3b1x): finalize sdk-invocation.md with subprocess contract and TH-07 compliance
Add comprehensive "Subprocess Contract" section documenting:
- argv layout with canonical form
- stdin discipline (password ingress, PDF bytes from stdin)
- stdout/stderr discipline (what goes where, what never gets logged)
- Exit code taxonomy (0, 64-78) with TH-03 (exit 78) and TH-07 (exit 64) refs
- Environment variable pass-through (PDFTRACT_PASSWORD, PDFTRACT_MCP_TOKEN, etc.)
- --progress-json event schema (ndjson format, all event types)
- --capture-diagnostics archive layout (zip/tar, contained files, scrubbing rules)

Update all language examples (Python, Node.js, Go, Ruby, Java, Rust) with
TH-07-compliant password handling:
- Pass password via PDFTRACT_PASSWORD env var (subprocess)
- Pass password via multipart form field (HTTP)
- Never use --password VALUE flag (rejected unless opt-in)

Add progress JSON parsing examples for Python, Node.js, and Rust showing
real-world event-driven progress tracking.

File grows from 1100 to 1837 lines (+737 lines, ~67%).

Closes: pdftract-3b1x
2026-05-24 07:48:09 -04:00
jedarden
1791bb6d80 docs(pdftract-32y9): finalize SDK architecture note with workspace layout, cross-compile matrix, and KU-12 alignment
- Add workspace layout section documenting pdftract-core as the only direct dependency,
  with pdftract-cli, pdftract-py, and pdftract-inspector-ui as siblings
- Update binary distribution table with correct target triples (musl not gnu for Linux)
- Add KU-12 cross-platform test limitation section with verbatim wording from plan:
  "Linux is fully CI-tested; macOS and Windows are build-tested and manually smoke-tested per release"
- Add Argo CI templates section (pdftract-cargo-build, pdftract-maturin-build)
- Add feature flag composition section with tiers, dependencies, and binary size budgets
- Add cross-references to sdk-invocation.md, sdk-contract.md, ocr-language-packs.md
- Fix clippy warnings in build.rs files (expect_fun_call, get_first, manual_strip, unused imports)

Closes: pdftract-32y9

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-24 06:38:23 -04:00
jedarden
67b3fde4d6 feat(pdftract-j6yd): implement signatures array output + validation_status enum + schema integration
Add document-level /signatures array output per Phase 7.3 of the plan.

Changes:
- Add SignatureJson struct to schema module with all signature metadata fields
- Update ExtractionResult to include signatures: Vec<SignatureJson>
- Integrate signature extraction into extract_pdf() pipeline
- Update result_to_json() to include signatures in JSON output
- Update JSON schema with signatures array and SignatureJson definition
- Add markdown sink signatures footer when signatures are present
- Add comprehensive tests for signature JSON serialization and validation

Acceptance criteria:
- Schema tests: 5/5 signature JSON tests pass
- Markdown sink emits Signatures footer when count > 0
- PyO3 binding automatically handles Vec<SignatureJson> via serde
- docs/schema/v1.0/pdftract.schema.json updated with signatures shape

Verification note: notes/pdftract-j6yd.md

Closes: pdftract-j6yd
2026-05-24 04:05:34 -04:00
jedarden
d174725241 docs(pdftract-5vhp): bring word-boundary-reconstruction.md to v1.0 final-pass
Complete documentation of the adaptive word-boundary algorithm including:
- Initial threshold = 0.25 * font_size
- 20-glyph median adjustment
- 1.5x median formula
- Full Tc/Tw/Tz (character-spacing, word-spacing, horizontal-scaling) corrections

Expanded from 202 lines to 899 lines with:
- Section 3.1: Tc/Tw/Tz formula with explicit parameter table
- Section 3.2: Text-space vs. device-space comparison per plan line 1550
- Section 4: Adaptive algorithm specification (20-glyph window, 1.5× median, outlier exclusion)
- Section 11: Complete pseudo-code (data structures, main loop, detection, threshold computation)
- Section 12: Edge cases (ZWJ, combining marks, CJK, justified text, monospaced, RTL, ligatures, soft hyphens, tabs)
- Section 13: Validation methodology (corpus at tests/fixtures/word-boundary-corpus/, 141 PDFs, 8 categories)
- Section 14: Implementation checklist and references

Closes: pdftract-5vhp
2026-05-24 03:55:43 -04:00
jedarden
28c31ba0a1 feat(pdftract-vk0gc): implement markdown anchors with parser regex
Add --md-anchors flag that emits HTML comment markers before each block
in Markdown output, allowing downstream tools to map excerpts back to
precise PDF locations.

Changes:
- Add markdown module with Anchor struct and parse_anchors() function
- Regex: <!-- pdftract: page=(\d+) block=(\d+) bbox=[([\d.,]+)] kind=(\w+) -->
- Add markdown_anchors: bool to ExtractionOptions
- Add --md-anchors CLI flag
- Implement block_to_markdown() and page_to_markdown() functions
- Add comprehensive documentation in docs/integrations/markdown-anchors.md
- 16 unit tests pass, including roundtrip test

Closes: pdftract-vk0gc
2026-05-24 02:49:16 -04:00
jedarden
cf8f04e3ec docs(pdftract-26r8): finalize glyph recognition research note v1.0
- Reorganize around the four-level Unicode recovery cascade from plan
- Document all cascade levels with confidence scores:
  - Level 1: ToUnicode CMap (1.0)
  - Level 2: Encoding + AGL (0.9)
  - Level 3: Font fingerprint cache (0.85)
  - Level 4: Glyph shape recognition (0.7)
- Add shape database design (pHash algorithm, query, format)
- Document pHash collision tie-break rules (frequency-based)
- Add Type 3 font handling section
- Cross-reference Phase 2.2, 2.4, 2.5 and OQ-02

File grows from 112 to 210 lines. Covers all acceptance criteria.

Closes: pdftract-26r8
2026-05-24 02:10:06 -04:00
jedarden
92e90af0b0 feat(pdftract-zy2jx): generate JSON Schema from Rust output types
- Add schemars dependency to pdftract-core (v1.2)
- Add JsonSchema derives to output types (ExtractionResult, PageResult, ExtractionMetadata, SpanJson, BlockJson, CellJson, RowJson, TableJson, ExtractionQuality, Receipt, ReceiptsMode)
- Create xtask/src/bin/gen_schema.rs for schema generation
- Add gen-schema command to xtask main.rs
- Generate docs/schema/v1.0/pdftract.schema.json using Draft 2020-12

Schema includes:
- $schema: "https://json-schema.org/draft/2020-12/schema"
- $defs with all output type definitions
- Proper type annotations for all fields

Closes: pdftract-zy2jx
2026-05-24 01:29:14 -04:00
jedarden
bf37f0f05f docs(pdftract-645y): finalize extraction-output-schema.md v1.0 with all Phase 6.1 fields
This commit brings docs/research/extraction-output-schema.md to v1.0 final-pass
specification, aligning with Phase 6.1 deliverables and plan requirements.

**Key additions:**
- page_number field documented with page_index relationship (1-based vs 0-based)
- page_type enum expanded with all six values: text, scanned, mixed, broken_vector,
  blank, figure_only — with broken_vector cross-referenced to Phase 5.5
- Block kind enum fully documented: paragraph, heading, list, table, figure, caption,
  code, formula, watermark, header, footer
- Attachments schema with base64 contentEncoding and 50MB truncation rule
- Profile-based classification fields (document_type, document_type_confidence,
  document_type_reasons, profile_name, profile_version, profile_fields)
- Schema Version Compatibility section with additive-evolution rules
- JSON Schema cross-reference throughout

**Format changes:**
- Restructured with ATX headings (## for sections)
- Added explicit field tables for each major schema section
- Cross-linked to machine-readable JSON Schema at docs/schema/v1.0/pdftract.schema.json
- Grew from 81 lines to 304 lines per acceptance criteria

**Plan references:**
- Lines 97, 2002-2030, 2017, 1836, 2640, 1709, 1752, 2989-3006, 3659
- INV-9 page_type taxonomy stability

Co-Authored-By: Claude Code (GLM-4.7) <noreply@anthropic.com>
2026-05-24 00:59:23 -04:00
jedarden
d14ec92fcb feat(pdftract-3zhf): add unified TableDetector::detect entry point
Add unified detect() method to TableDetector that combines both
line-based and borderless table detection pipelines. This completes
the coordinator bead for Phase 7.2: Table Detection and Structure
Reconstruction.

All child beads (7.2.1-7.2.6) are closed:
- 7.2.1: Line-based detection (path segment clustering)
- 7.2.2: Borderless detection (x0 alignment heuristic)
- 7.2.3: Span-to-cell assignment (centroid containment)
- 7.2.4: Header row detection (bold + StructTree TH)
- 7.2.5: Merged cell detection (missing interior edges)
- 7.2.6: Table JSON output schema integration

Critical tests pass:
- 5x3 bordered table (15 cells extracted)
- Merged header cell colspan=3
- Borderless 3-column table detection
- Two-page table continuation detection

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-24 00:51:59 -04:00
jedarden
33372c23ae fix(pdftract-3c4i): export detect_merged_cells from table module
The detect_merged_cells function was implemented but not exported from
the table module, making it inaccessible to library users. This commit
adds the function to the public API exports.

Also adds a verification note documenting the complete implementation
and the export fix.

Acceptance criteria status:
- All 6 merged cell detection tests pass
- Public Cell.rowspan/colspan fields exist with default 1
- Absorbed cells are excluded from output
- Bbox of merged cell covers absorbed cells
- Borderless tables NO-OP with diagnostic

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-24 00:23:14 -04:00
jedarden
26bdd255c8 feat(pdftract-ilen): implement header row detection with bold+TH support
Implement header row detection for tables using two signals:
1. Bold font detection (fully implemented)
2. StructTree TH detection (stub pending MCID tracking)

Bold detection:
- is_bold_font(): detects bold fonts from PostScript name patterns
- is_cell_bold(): checks if all non-whitespace content in a cell is bold
- is_bold_header_row(): validates rows with >=2 bold cells
- count_header_rows(): counts contiguous bold headers from top
- Cell::mark_header_rows(): sets is_header_row flag on cells

TH detection (stub):
- is_th_header_row(): placeholder for StructTree TH detection
  Requires MCID tracking on TableSpan (future work)
  Will use ParentTree to map MCIDs to StructElems
  Will verify TR > TH chain structure

Combined detection:
- is_header_row(): combines bold and TH signals
- Bold wins on conflict per body data design principle

Documentation:
- Updated table-structure-reconstruction.md with full header detection spec
- Documented implemented vs pending signals
- Added implementation notes for TH detection

Tests:
- 45 tests covering all bold detection scenarios
- Tests for multi-row headers (contiguous from top)
- Tests for single-cell row exclusion
- Tests for empty/whitespace cell handling
- Placeholder tests for TH detection

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-23 23:32:54 -04:00
jedarden
9b5fbc9b5e feat(pdftract-bf-2y2rp): implement lazy stream decoding for PDF extraction
- Add decode_page_content_streams() function for per-page lazy decode
- Update extract_page_from_dict() to support lazy stream decoding
- Modify extract_pdf() and extract_pdf_ndjson() to enable lazy decoding
- Fix borrow checker issue in LazyPageIter::next()

This ensures content streams are decoded lazily per page and dropped
immediately after processing, keeping peak RSS flat across page count.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-23 12:30:26 -04:00
jedarden
9fca24c77a docs(plan): SDKs are monorepo members, not separate repos
Add a Repository Layout subsection: SDK source lives at root-level pdftract-<lang>/
in this monorepo (single source of truth), generated via pdftract sdk codegen and
published to language registries from here. Retire the legacy standalone repos.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-22 07:21:45 -04:00
jedarden
2251f8a9c0 docs(plan): make bounded peak-RSS a CI-gated target; default max_decompress_bytes 2GB->512MB
Add a Memory targets table as a first-class acceptance criterion alongside
Accuracy/Speed/Weight, with a hard per-document peak-RSS ceiling that must not
scale with input/payload. Promote OOM-safety to a Tier-1 hard gate. Reconcile
the contradictory 2 GB max_decompress_bytes default to the research-backed 512 MB
(root cause of an observed multi-GB OOM via the unbounded PNG-predictor pre-alloc
under rayon page parallelism).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-21 23:25:50 -04:00
jedarden
bb5346b305 docs(pdftract-58kz): add security policy documentation
Add comprehensive SECURITY.md covering:
- Supported versions policy
- Private vulnerability reporting (email + GitHub)
- 90-day disclosure window with timelines
- CVE assignment via GitHub Security Advisories
- In-scope and out-of-scope vulnerability classes
- Safe harbor policy for good-faith researchers

Add security issue template redirecting users to private reporting.
Add Security section to CONTRIBUTING.md and README.md with links to SECURITY.md.
Add docs/security/pgp-public-key.asc placeholder with generation instructions.

References: bead pdftract-58kz, plan line 3433

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-20 19:39:24 -04:00
jedarden
9456d8e231 feat(pdftract-5omc): implement per-language conformance test runner pattern
Implements the conformance test runner pattern for all 10 SDKs as specified
in the plan (line 3547). Each SDK now has a dedicated conformance test runner.

Created:
- tests/sdk-conformance/report-schema.json: JSON schema for conformance reports
- docs/notes/sdk-conformance-runner.md: Pattern documentation and reference
- crates/pdftract-cli/tests/conformance.rs: Rust cargo test target
- tests/conformance/test_conformance.py: Python pytest harness
- tests/conformance/conformance.test.ts: Node.js vitest runner
- tests/conformance/conformance_test.go: Go go test runner
- tests/conformance/ConformanceTest.java: Java JUnit 5 runner
- tests/conformance/ConformanceTests.cs: .NET xUnit runner
- tests/conformance/conformance.c: C standalone binary
- tests/conformance/conformance_test.rb: Ruby minitest runner
- tests/conformance/ConformanceTest.php: PHP PHPUnit runner
- tests/conformance/ConformanceTests.swift: Swift XCTest runner

All runners implement:
- Loading of tests/sdk-conformance/cases.json
- Execution of test cases with language-native method invocations
- Comparison of results against expected values with numeric tolerances
- Emission of machine-readable conformance-report.json
- Non-zero exit on failures/errors for CI gating

Acceptance criteria:
- PASS: All 10 SDKs have language-specific runners
- PASS: Runners consume shared cases.json
- PASS: Runners emit JSON reports matching schema
- PASS: Runners exit non-zero on failure
- WARN: README integration pending SDK repo creation
- WARN: Stub implementations return placeholder results

References:
- Plan line 3547: "Every SDK has a pdftract-sdk-conformance test runner"
- Plan line 3589: "Conformance suite results published as Argo artifact"

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Bead-Id: pdftract-5omc
2026-05-18 01:32:24 -04:00
jedarden
857f928732 feat(pdftract-5omc): implement SDK conformance test runner pattern
Implement the conformance test runner pattern that every SDK will
implement to validate against the shared test suite.

- Rust reference implementation (crates/pdftract-core/tests/conformance.rs)
  * Full test suite loader and executor
  * Comparison engine with min/max, string constraints, tolerances
  * Skip logic for unsupported features and schema versions
  * Report generation in JSON format

- CLI compare subcommand (crates/pdftract-cli/src/main.rs)
  * pdftract compare - Compare actual vs expected with tolerances
  * Cross-language comparison tool to avoid reimplementations

- Documentation (docs/conformance/sdk-contract.md)
  * Complete pattern specification with pseudocode
  * Per-language runner locations
  * CI integration requirements

- Python reference stub (tests/python-conformance/test_conformance.py)
  * Full pytest-based implementation following the pattern

Closes: pdftract-5omc
2026-05-18 01:22:23 -04:00
jedarden
a34f9c18d0 docs(pdftract-1g87): create mdBook scaffolding for user documentation
- book.toml with title, authors, build directory, edit-url-template
- src/SUMMARY.md with complete TOC for all planned sections
- src/introduction.md: what pdftract does and doesn't do (Non-Goals)
- src/installation.md: cargo, pip, Homebrew, Docker; KU-12 caveat verbatim
- src/quickstart.md: five-minute walkthrough with executable commands
- 39 draft placeholder files for CLI reference, schema, profiles, SDKs, advanced topics, troubleshooting, FAQ

mdbook build completes cleanly with zero warnings (linkcheck optional).

See notes/pdftract-1g87.md for verification details.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-18 00:38:51 -04:00
jedarden
5e66846288 docs(pdftract-147a): author SDK contract specification
Add comprehensive SDK contract specification at docs/notes/sdk-contract.md.
This document serves as the constitutional specification for all pdftract
SDK implementations across all languages.

The contract defines:
- Method surface (9 methods mirroring CLI/MCP tools)
- Error mapping (CLI exit codes → native exceptions)
- Versioning compatibility rules (MAJOR lock, MINOR flexibility)
- Option-naming conventions (CLI flag → language-native case)
- Native type-mapping requirements (Document, Page, Span, Block, Match, Fingerprint, Classification)
- Async conventions per language
- Conformance enforcement (100% pass required)
- Change policy (ADR required for contract changes)

Verification note: notes/pdftract-147a.md

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-17 23:13:55 -04:00
jedarden
9f27d16f25 docs(phase-0.1): verify pdftract-ci scaffolding complete
Verified the pdftract-ci WorkflowTemplate exists in declarative-config
and is correctly synced to the iad-ci cluster. All scaffolding
requirements met for Phase 0.1.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-17 03:24:36 -04:00
jedarden
7035706068 docs(plan): fix 3 HIGH gaps + 3 LOW items from Round 5 gap review
HIGH:
- Add outline/bookmark traversal spec to Phase 1.4 (linked list walk, PDFDocEncoding vs UTF-16BE)
- Specify base64 encoding for attachment data field in Phase 7.5
- Move decompression limit to ExtractionOptions.max_decompress_bytes (universal, not serve-only);
  add max_decompress_gb to CLI/Python/HTTP API surfaces

LOW:
- Split log+env_logger into two dep matrix rows for accurate crate count
- Add full_render to Python keyword args and HTTP form fields (with no-op note)
- Clarify v0.1.0 milestone: "all applicable" targets (OCR speed target excluded until v0.2.0)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-16 18:30:02 -04:00
jedarden
2ba51a8a73 docs(plan): fix 4 gaps from Round 4 gap review
- Fix quick-xml feature gate: move from ocr to default (XMP conformance detection)
- Make page_number schema update an explicit Phase 6.1 deliverable
- Add PageClass → page_type mapping table; define broken_vector as valid output value
- Fix CI test matrix: musl target excludes ocr/python features; glibc runs --all-features

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-16 18:24:12 -04:00
jedarden
2d194a4b1b docs(plan): fix 15 gaps from Round 3 gap review
HIGH:
- Add fontdue crate for glyph rasterization (ttf-parser is a parser, not rasterizer)
- Remove num_cpus reference (rayon default pool sizing is sufficient)
- Update dep count target to < 30 direct crates (< 20 was violated by plan's own list)
- Fix watermark deferral: Phase 7 not Phase 6; no kind:'watermark' until Phase 7
- Add Phase 7.6 (Hyperlink/Annotation Extraction) and 7.7 (Article Thread Chains)

MEDIUM:
- Document header/footer streaming mode limitation: first 3 pages emit as paragraph
- Add conformance/XFA detection spec to Phase 1.4; move quick-xml to default feature
- Clarify pdftract-py-ci is Phase 0 stub, filled in during Phase 6.3
- Specify /Contents array concatenation in Phase 1.4 page tree
- Add page rotation un-rotation step after Phase 3 glyph bbox computation
- Add password delivery: ExtractionOptions.password, --password CLI, HTTP form, Python kwarg
- Fix glyph shape DB: phf::Map → sorted &'static [(u64,char)] slice for Hamming nearest-neighbor
- Add Python benchmark runner infrastructure (python:3.11-slim, requirements.txt, hyperfine)
- Add wordlist-bloom to Feature flags bullet list

LOW:
- Clarify extract_stream() yields page dicts only, not header/footer frames

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-16 18:18:33 -04:00
jedarden
eb799c0956 docs(plan): fix 21 gaps from Round 2 gap review
CRITICAL:
- Fix deskew step: pixDeskew operates on grayscale, not binarized image

HIGH:
- Add sha2 crate to dep matrix (needed for font fingerprint hashing)
- Fix bloomfilter feature: wordlist-bloom (optional), not default conditional
- Add build-dependencies subsection (phf_codegen, serde_json)
- Add v0.1.0 fallback for tagged PDFs: XY-cut with TAGGED_PDF_STRUCT_TREE_DEFERRED diagnostic
- Add v0.1.0 fallback for BrokenVector: emit broken_vector page_type when ocr feature absent
- Add strsim crate for Levenshtein in header/footer deduplication
- Add tokio::task::spawn_blocking bridge for axum→rayon hand-off
- Fix JPX/CCITT OCR path: document full-render requirement; add OCR_JPX/CCITT_UNSUPPORTED diagnostics
- Fix OCG default visibility: /D/BaseState + /D/ON + /D/OFF (was incorrectly /D/AS)

MEDIUM:
- Add JBIG2 OCR limitation: requires full-render; OCR_JBIG2_UNSUPPORTED diagnostic
- Add Standard-14 font skip for Level 3 fingerprinting (no embedded program)
- Change flags field from EnumSet<SpanFlag> to u8 bitmask (removes undocumented enumset dep)
- Add tests/fixtures/encoding/ and tests/fixtures/perf/ to Tier 2 fixture list
- Add ocg_present to Phase 6.1 metadata field list
- Add "Phase 7 feature; empty in Phase 6" notes to links and annotations fields
- Add include_invisible/extract_forms/extract_attachments to HTTP serve form fields
- Clarify Phase 4.7 lang source: document-level /Lang, not per-span (Phase 7)
- Fix Stage 1-2 → Phases 1-2 in architectural decisions (stale draft terminology)
- Remove frame-index notation from NDJSON streaming critical test
- Define Hybrid threshold (≥15% each) and Hybrid merge strategy (vector wins on overlap)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-16 18:05:26 -04:00
jedarden
bcccc98fd7 docs(plan): fix 30 gaps from Round 1 gap review
CRITICAL fixes:
- Remove jpeg-decoder from Phase 1.5 crates (contradicted dep matrix)
- Specify word boundary adaptive threshold: text space, per-font-switch window, 20-glyph seed
- Add page_number (1-based) alongside page_index (0-based) to resolve SDK/schema mismatch
- Add mcid: Option<u32> to Glyph struct (was defined in 3.4 but missing from 3.2)
- Add aes + rc4 crates under new decrypt feature; document crypto dependency

HIGH fixes:
- Specify font fingerprint database format (phf::Map, SHA-256, ~500KB, JSON source)
- Fix Level 4 shape DB cross-ref (was "Phase 2.3", corrected to research doc); add Phase 2.5 definition
- Document header/footer cross-page pass as sequential post-rayon with Levenshtein matching
- Replace Tesseract box-file hint approach with PSM_SPARSE_TEXT + post-OCR validation
- Add HTTP serve security constraints: decompression bomb limit, auth guidance, no path params
- Add JavaScript detection spec to Phase 1.4 (all four JS action locations)
- Align CI benchmark gate to 10x pdfminer.six (was 5x, contradicted primary objectives)
- Add cargo bloat CI gate for phf word list size; bloomfilter fallback if >250KB
- Add pdftract-py-ci WorkflowTemplate note with manylinux/osxcross/cross approach
- Add ConfidenceSource enum → schema string mapping table in Phase 4.1

MEDIUM fixes:
- Define docs/schema/v1.0/pdftract.schema.json as Phase 6.1 deliverable
- Add unicode-bidi crate to dep matrix and Phase 4.2 for RTL detection
- Define Color enum with CSS hex conversion rules in Phase 3.1
- Remove bytes crate from Phase 1.2 (belongs in serve feature only; use Arc<[u8]>)
- Specify NDJSON buffer Condvar blocking behavior at window saturation
- Clarify pdftract:ocr vs pdftract:full Docker image tags and size budgets
- Add Docstrum parameters: k=5, Euclidean, ±30° constraints, root node definition
- Add code and formula block kind detection heuristics to Phase 4.4
- Add OCG visibility handling to Phase 1.4 (ON/OFF from /OCProperties /D /AS)
- Add linearized PDF detection and dual-xref merge to Phase 1.3
- Add HTTP 413 to error table with custom JSON rejection handler
- Add Phase 0: CI Infrastructure section (pdftract-ci WorkflowTemplate)

LOW fixes:
- Clarify Name length limit: 127 bytes pre-expansion, matching PDF spec 7.3.5
- Reorder preprocessing pipeline: contrast normalization before binarization (was after)
- Add CIDToGIDMap stream form: 2-byte big-endian GID array

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-16 17:45:04 -04:00
jedarden
d161d109b3 docs(plan): revise plan to center accuracy/speed/weight as hard targets
- Add Primary Objectives section with CI-gated measurable targets:
  accuracy (CER <0.5%, WER <3%, readability >0.85), speed (100pp <3s,
  10x vs pdfminer), weight (<4MB default binary, <20 default deps)
- Add feature-flag strategy: axum/tokio/pdfium/pyo3 are all optional;
  default build is core extraction + CLI only
- Add Phase 4.7: text readability validation and correction pipeline
  (ligature repair, hyphenation, mojibake detection, readability scoring)
- Make pdfium-render explicitly optional (full-render feature) vs. the
  always-present direct image compositing path
- Add Tier 4 competitive benchmark suite (vs. pdfminer.six, pypdf, pdfplumber)
- Remove jpeg-decoder and whichlang from dependency matrix (unnecessary)
- Rename implementation-plan.md → plan.md (matches CLAUDE.md reference)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-16 17:07:48 -04:00
jedarden
8753630bc3 Add parallel extraction research and comprehensive research index
New research document covering parallel extraction architecture:
rayon page-level parallelism, Arc<> shared xref/font/object-stream
caches, RwLock font cache design, Tesseract thread-local OCR pool,
semaphore memory budget, ordered NDJSON streaming slot array, and
catch_unwind error isolation per page.

Also adds docs/research-index.md: a 622-line navigable index of all
83 research documents grouped into 9 thematic categories, with a
"Start Here" reading path, per-phase implementation reading tables,
and an alphabetical lookup table covering every document.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-16 16:30:35 -04:00
jedarden
92e6196ac5 Add research: Ruby/furigana typography, PDF/VT variable printing
Two new research documents covering Japanese Ruby text and East Asian
typography (tagged/untagged furigana extraction, Kinsoku Shori spacing,
full-width normalization, tate-chu-yoko, CJK/Latin boundary detection,
ruby_text output field) and PDF/VT variable and transactional printing
(DPart hierarchy traversal, per-record extraction model, DPM metadata,
variable vs. static content classification, postal address extraction,
records array output schema).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-16 16:24:21 -04:00
jedarden
e3b72efc83 Add research: Southeast Asian scripts, OpenType MATH formula extraction
Two new research documents covering Southeast Asian script extraction
(Thai/Khmer/Myanmar/Lao/Tibetan/Ethiopic — cluster structure, no-space
word boundary policy for Thai/Lao, Zawgyi vs Unicode detection for
Myanmar, USE shaping, Tesseract fallback) and OpenType MATH table
exploitation for formula extraction (MathConstants for fraction/
subscript/radical layout, TeX OML/OMS/OMX encoding tables, MathML
output generation, GlyphAssembly reconstruction, alternative text
and MathJax XMP source recovery).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-16 16:21:48 -04:00