Commit graph

10 commits

Author SHA1 Message Date
jedarden
eb025f7b1a docs(pdftract-3wrx): add release signing strategy note
Resolves OQ-10: document v1.0.0 stance on binary signing.
- Linux: GPG-signed (implemented)
- macOS: Deferred to v1.1+ ($99/yr Apple Developer Program)
- Windows: Deferred to v1.1+ ($200-400/yr Authenticode cert)
- All platforms: SLSA Level 2 attestation (already committed)

Closes: pdftract-3wrx

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-24 11:12:56 -04:00
jedarden
57df42f478 docs(pdftract-3b1x): finalize sdk-invocation.md with subprocess contract and TH-07 compliance
Add comprehensive "Subprocess Contract" section documenting:
- argv layout with canonical form
- stdin discipline (password ingress, PDF bytes from stdin)
- stdout/stderr discipline (what goes where, what never gets logged)
- Exit code taxonomy (0, 64-78) with TH-03 (exit 78) and TH-07 (exit 64) refs
- Environment variable pass-through (PDFTRACT_PASSWORD, PDFTRACT_MCP_TOKEN, etc.)
- --progress-json event schema (ndjson format, all event types)
- --capture-diagnostics archive layout (zip/tar, contained files, scrubbing rules)

Update all language examples (Python, Node.js, Go, Ruby, Java, Rust) with
TH-07-compliant password handling:
- Pass password via PDFTRACT_PASSWORD env var (subprocess)
- Pass password via multipart form field (HTTP)
- Never use --password VALUE flag (rejected unless opt-in)

Add progress JSON parsing examples for Python, Node.js, and Rust showing
real-world event-driven progress tracking.

File grows from 1100 to 1837 lines (+737 lines, ~67%).

Closes: pdftract-3b1x
2026-05-24 07:48:09 -04:00
jedarden
1791bb6d80 docs(pdftract-32y9): finalize SDK architecture note with workspace layout, cross-compile matrix, and KU-12 alignment
- Add workspace layout section documenting pdftract-core as the only direct dependency,
  with pdftract-cli, pdftract-py, and pdftract-inspector-ui as siblings
- Update binary distribution table with correct target triples (musl not gnu for Linux)
- Add KU-12 cross-platform test limitation section with verbatim wording from plan:
  "Linux is fully CI-tested; macOS and Windows are build-tested and manually smoke-tested per release"
- Add Argo CI templates section (pdftract-cargo-build, pdftract-maturin-build)
- Add feature flag composition section with tiers, dependencies, and binary size budgets
- Add cross-references to sdk-invocation.md, sdk-contract.md, ocr-language-packs.md
- Fix clippy warnings in build.rs files (expect_fun_call, get_first, manual_strip, unused imports)

Closes: pdftract-32y9

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-24 06:38:23 -04:00
jedarden
d14ec92fcb feat(pdftract-3zhf): add unified TableDetector::detect entry point
Add unified detect() method to TableDetector that combines both
line-based and borderless table detection pipelines. This completes
the coordinator bead for Phase 7.2: Table Detection and Structure
Reconstruction.

All child beads (7.2.1-7.2.6) are closed:
- 7.2.1: Line-based detection (path segment clustering)
- 7.2.2: Borderless detection (x0 alignment heuristic)
- 7.2.3: Span-to-cell assignment (centroid containment)
- 7.2.4: Header row detection (bold + StructTree TH)
- 7.2.5: Merged cell detection (missing interior edges)
- 7.2.6: Table JSON output schema integration

Critical tests pass:
- 5x3 bordered table (15 cells extracted)
- Merged header cell colspan=3
- Borderless 3-column table detection
- Two-page table continuation detection

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-24 00:51:59 -04:00
jedarden
33372c23ae fix(pdftract-3c4i): export detect_merged_cells from table module
The detect_merged_cells function was implemented but not exported from
the table module, making it inaccessible to library users. This commit
adds the function to the public API exports.

Also adds a verification note documenting the complete implementation
and the export fix.

Acceptance criteria status:
- All 6 merged cell detection tests pass
- Public Cell.rowspan/colspan fields exist with default 1
- Absorbed cells are excluded from output
- Bbox of merged cell covers absorbed cells
- Borderless tables NO-OP with diagnostic

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-24 00:23:14 -04:00
jedarden
9456d8e231 feat(pdftract-5omc): implement per-language conformance test runner pattern
Implements the conformance test runner pattern for all 10 SDKs as specified
in the plan (line 3547). Each SDK now has a dedicated conformance test runner.

Created:
- tests/sdk-conformance/report-schema.json: JSON schema for conformance reports
- docs/notes/sdk-conformance-runner.md: Pattern documentation and reference
- crates/pdftract-cli/tests/conformance.rs: Rust cargo test target
- tests/conformance/test_conformance.py: Python pytest harness
- tests/conformance/conformance.test.ts: Node.js vitest runner
- tests/conformance/conformance_test.go: Go go test runner
- tests/conformance/ConformanceTest.java: Java JUnit 5 runner
- tests/conformance/ConformanceTests.cs: .NET xUnit runner
- tests/conformance/conformance.c: C standalone binary
- tests/conformance/conformance_test.rb: Ruby minitest runner
- tests/conformance/ConformanceTest.php: PHP PHPUnit runner
- tests/conformance/ConformanceTests.swift: Swift XCTest runner

All runners implement:
- Loading of tests/sdk-conformance/cases.json
- Execution of test cases with language-native method invocations
- Comparison of results against expected values with numeric tolerances
- Emission of machine-readable conformance-report.json
- Non-zero exit on failures/errors for CI gating

Acceptance criteria:
- PASS: All 10 SDKs have language-specific runners
- PASS: Runners consume shared cases.json
- PASS: Runners emit JSON reports matching schema
- PASS: Runners exit non-zero on failure
- WARN: README integration pending SDK repo creation
- WARN: Stub implementations return placeholder results

References:
- Plan line 3547: "Every SDK has a pdftract-sdk-conformance test runner"
- Plan line 3589: "Conformance suite results published as Argo artifact"

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Bead-Id: pdftract-5omc
2026-05-18 01:32:24 -04:00
jedarden
5e66846288 docs(pdftract-147a): author SDK contract specification
Add comprehensive SDK contract specification at docs/notes/sdk-contract.md.
This document serves as the constitutional specification for all pdftract
SDK implementations across all languages.

The contract defines:
- Method surface (9 methods mirroring CLI/MCP tools)
- Error mapping (CLI exit codes → native exceptions)
- Versioning compatibility rules (MAJOR lock, MINOR flexibility)
- Option-naming conventions (CLI flag → language-native case)
- Native type-mapping requirements (Document, Page, Span, Block, Match, Fingerprint, Classification)
- Async conventions per language
- Conformance enforcement (100% pass required)
- Change policy (ADR required for contract changes)

Verification note: notes/pdftract-147a.md

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-17 23:13:55 -04:00
jedarden
ef9c03095d Add SDK architecture notes covering top 10 languages
Covers TypeScript, C#, C++, PHP, and Kotlin gaps with full code examples
for both subprocess and HTTP tracks, NuGet RID packaging detail, PHP FFI
options, and implementation sequencing.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-16 14:51:25 -04:00
jedarden
c2870e6640 Add research docs and SDK invocation notes
Four research documents covering PDF spec fundamentals, font types and
encoding, glyph Unicode recovery, and tagged PDF structure/reading order.
SDK invocation notes with subprocess and HTTP examples for Python, Node.js,
Go, Ruby, Java, Rust, and Bash.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-16 14:33:34 -04:00
jedarden
4ae798c8b1 Initial repo scaffold with README and docs structure
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-16 14:26:16 -04:00