docs(pdftract-32y9): finalize SDK architecture note with workspace layout, cross-compile matrix, and KU-12 alignment

- Add workspace layout section documenting pdftract-core as the only direct dependency,
  with pdftract-cli, pdftract-py, and pdftract-inspector-ui as siblings
- Update binary distribution table with correct target triples (musl not gnu for Linux)
- Add KU-12 cross-platform test limitation section with verbatim wording from plan:
  "Linux is fully CI-tested; macOS and Windows are build-tested and manually smoke-tested per release"
- Add Argo CI templates section (pdftract-cargo-build, pdftract-maturin-build)
- Add feature flag composition section with tiers, dependencies, and binary size budgets
- Add cross-references to sdk-invocation.md, sdk-contract.md, ocr-language-packs.md
- Fix clippy warnings in build.rs files (expect_fun_call, get_first, manual_strip, unused imports)

Closes: pdftract-32y9

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
This commit is contained in:
jedarden 2026-05-24 06:38:23 -04:00
parent 7a70bb82b8
commit 1791bb6d80
3 changed files with 113 additions and 15 deletions

View file

@ -1,4 +1,3 @@
use std::env;
use std::process::Command;
fn main() {

View file

@ -297,8 +297,8 @@ pub static AGL_MULTI: phf::Map<&'static str, &[char]> = {};
fn decode_json_unicode(s: &str) -> String {
// The JSON has "\\uXXXX" which Rust reads as "\uXXXX"
// We need to extract just the hex part
if s.starts_with("\\u") {
s[2..].to_string()
if let Some(suffix) = s.strip_prefix("\\u") {
suffix.to_string()
} else {
s.to_string()
}
@ -370,7 +370,7 @@ fn generate_font_fingerprints(out_dir: &Path, fingerprints_path: &Path) {
for entry in entries {
let arr = entry.as_array().expect("entry must be an array");
let gid = arr
.get(0)
.first()
.and_then(|v| v.as_u64())
.expect("gid must be a number") as u16;
let codepoint = arr
@ -496,16 +496,15 @@ pub fn cid_to_unicode(cid: u32) -> Option<&'static [char]> {{
collection = module_name.to_uppercase(),
json_name = json_name,
);
fs::write(&out_path, rust_code).expect(&format!("Failed to write {}", out_path.display()));
fs::write(&out_path, rust_code).unwrap_or_else(|_| panic!("Failed to write {}", out_path.display()));
return;
}
let json_content =
fs::read_to_string(&json_path).expect(&format!("Failed to read {}", json_path.display()));
fs::read_to_string(&json_path).unwrap_or_else(|_| panic!("Failed to read {}", json_path.display()));
let data: serde_json::Value = serde_json::from_str(&json_content)
.expect(&format!("Failed to parse {}", json_path.display()));
.unwrap_or_else(|_| panic!("Failed to parse {}", json_path.display()));
// Build phf map
let mut map_builder = phf_codegen::Map::new();
@ -515,7 +514,7 @@ pub fn cid_to_unicode(cid: u32) -> Option<&'static [char]> {{
for (cid_str, unicode_value) in mappings {
let cid: u32 = cid_str
.parse()
.expect(&format!("Invalid CID key: {}", cid_str));
.unwrap_or_else(|_| panic!("Invalid CID key: {}", cid_str));
// Parse the Unicode value
if let Some(unicode_str) = unicode_value.as_str() {
@ -573,7 +572,7 @@ pub fn cid_to_unicode(cid: u32) -> Option<&'static [char]> {{
map = map_builder.build(),
);
fs::write(&out_path, rust_code).expect(&format!("Failed to write {}", out_path.display()));
fs::write(&out_path, rust_code).unwrap_or_else(|_| panic!("Failed to write {}", out_path.display()));
}
/// Parse a Unicode value from JSON to a Vec<char>.
@ -631,7 +630,7 @@ fn parse_unicode_value(s: &str) -> Vec<char> {
/// "frequency_rank": 1
/// }
/// ```
fn generate_shape_db(out_dir: &Path, shapes_path: &Path) {
fn generate_shape_db(out_dir: &Path, _shapes_path: &Path) {
// Resolve shapes_path relative to the workspace root
// build.rs runs from the crate directory, but the build/ dir is at workspace root
// We can find the workspace root by going up from the crate directory
@ -701,7 +700,7 @@ const _: () = assert!(SHAPE_TABLE.len() == FREQ_TABLE.len());
}
// Sort by pHash ascending
sorted_entries.sort_by(|a, b| a.0.cmp(&b.0));
sorted_entries.sort_by_key(|a| a.0);
// Check for duplicate pHash entries
for i in 1..sorted_entries.len() {

View file

@ -7,6 +7,28 @@ JavaScript, Go, Ruby, Java, Rust, and Bash. Gaps: TypeScript, C#, C++, PHP, and
---
## Workspace layout
The workspace is organized so that `pdftract-core` is the only crate that other consumers depend on directly. The CLI, Python bindings, and inspector UI are siblings that compose `pdftract-core` behind their respective surfaces.
```
pdftract/
├── Cargo.toml (workspace root)
├── crates/
│ ├── pdftract-core/ (library — only direct dependency for downstream consumers)
│ ├── pdftract-cli/ (binary)
│ ├── pdftract-py/ (PyO3 bindings, optional feature)
│ └── pdftract-inspector-ui/ (HTML/CSS/JS bundled via include_bytes!, Phase 7.9)
└── docs/
├── plan/plan.md
├── research/ (per-feature deep dives)
└── notes/ (this file, sdk-invocation.md, sdk-contract.md, ocr-language-packs.md)
```
See `docs/plan/plan.md` lines 141268 for the full file and module layout specification.
---
## Common infrastructure (required before any SDK ships)
### Binary distribution
@ -15,16 +37,36 @@ Every SDK approach — subprocess or native — depends on platform binaries pub
| Target triple | Platform |
|---|---|
| `x86_64-unknown-linux-gnu` | Linux x86_64 |
| `aarch64-unknown-linux-gnu` | Linux ARM64 |
| `x86_64-unknown-linux-musl` | Linux x86_64 (production binary) |
| `aarch64-unknown-linux-musl` | Linux ARM64 |
| `x86_64-apple-darwin` | macOS Intel |
| `aarch64-apple-darwin` | macOS Apple Silicon |
| `x86_64-pc-windows-msvc` | Windows x86_64 |
| `x86_64-pc-windows-gnu` | Windows x86_64 |
The CI workflow must cross-compile for all five targets and attach the binaries to a versioned
GitHub Release tag on every release. SDKs pin to a binary version and download the appropriate
artifact at install time.
### Cross-platform test limitation (KU-12)
Per ADR-009, `iad-ci` is Linux-only. **Linux is fully CI-tested; macOS and Windows are build-tested and manually smoke-tested per release.** macOS and Windows binaries are *built* via `cross` on Linux but never *executed* in CI. This is acknowledged as Known Unknown KU-12 with the following mitigation:
- A manual smoke-test runbook in `docs/operations/manual-platform-smoke.md` is executed by the release lead before each milestone tag against at least one physical macOS machine and one Windows VM
- User bug reports for platform-specific issues acknowledged within 48 hours and addressed in the next patch release
See `docs/plan/plan.md` lines 34313436 and lines 608609 for the full KU-12 specification.
### Argo CI templates
Binary and wheel builds are orchestrated by Argo WorkflowTemplates on the `iad-ci` Rackspace Spot cluster:
- `pdftract-cargo-build` — builds the Rust binary for all five target triples using `cross` (Docker-based cross-compilation)
- `pdftract-maturin-build` — builds the PyO3 wheel for all five target triples (uses `ghcr.io/rust-cross/manylinux` for Linux, `osxcross` for macOS, `cross` for Windows)
GitHub Actions is **FORBIDDEN** per ADR-009. All CI runs on `iad-ci`; secrets live in OpenBao and reach workflows via ESO-synced Kubernetes Secrets.
See `docs/plan/plan.md` ADR-009 (lines 495502) and Phase 0.2 (lines 10151029) for the full CI specification.
### Release format
```
@ -35,6 +77,47 @@ Semantic versioning is required before any package is published to a package reg
---
## Feature flag composition
Feature flags control the binary footprint. The default build (`cargo build`) includes only the core extraction path. Heavy optional capabilities are behind named features.
### Feature tiers
| Tier | Features | Binary size (stripped) | Use case |
|---|---|---|---|
| **slim** | `["cli", "decrypt"]` | < 3 MB | Minimal CLI without Markdown |
| **default** | `["cli", "decrypt", "markdown"]` | < 4 MB | Standard CLI with Markdown output |
| **serve** | `default + ["serve"]` | < 12 MB | HTTP server mode |
| **ocr** | `default + ["ocr"]` | < 12 MB | OCR with Tesseract |
| **full** | `default + ["serve", "ocr", "mcp", "inspect", "grep", "profiles", "cache", "receipts", "remote"]` | < 14 MB | All features except `full-render` |
### Feature dependencies
Some features implicitly enable others:
- `serve` → enables `cache` (the HTTP server is the primary cache consumer)
- `mcp` → depends on `serve` (both transports share the HTTP infrastructure)
- `inspect` → depends on `serve` (bundles a ~80 KB static HTML/CSS/JS frontend via `include_bytes!`)
- `grep` → requires `regex` crate
- `profiles` → requires `regex` crate
### Binary size budgets
Per the Primary Objectives (Weight Targets):
| Metric | Target |
|---|---|
| Binary size, default features (no OCR, no serve) | < 4 MB stripped |
| Binary size, `--features ocr,serve` | < 12 MB stripped |
| Binary size, `--features full` (everything except `full-render`) | < 14 MB stripped |
| Docker image, CLI only | < 20 MB (distroless base) |
| Docker image, with OCR (`tesseract-ocr` system pkg) | < 120 MB |
| Docker image, `pdftract:full` | < 140 MB |
See `docs/plan/plan.md` lines 4662 for the full weight targets specification.
---
## Two SDK tracks
### Track A — Subprocess / HTTP wrappers
@ -577,3 +660,20 @@ Separate artifact ID (`pdftract-kotlin`) so Java users don't pull in Kotlin stdl
| 5 | C++ | 12 days | `popen` + libcurl; no package manager standard, distribute as vcpkg port |
All five are blocked on the GitHub Releases binary distribution infrastructure being in place first.
---
## Cross-references
Related documentation:
- **[`sdk-invocation.md`](sdk-invocation.md)** — Subprocess and HTTP invocation patterns for all supported languages
- **[`sdk-contract.md`](sdk-contract.md)** — The constitutional SDK specification (method surface, error mapping, versioning, conformance)
- **[`ocr-language-packs.md`](ocr-language-packs.md)** — Tesseract language pack distribution and installation
- **[`docs/plan/plan.md`](../plan/plan.md)** — The source of truth for all architectural decisions (workspace layout, cross-compile matrix, ADR-009 CI policy, KU-12 platform testing)
See also:
- **Phase 6.3** (PyO3 bindings) — Python wheel build matrix via `pdftract-maturin-build`
- **Phase 7.9** (Inspector UI) — Web debug viewer bundled via `include_bytes!`
- **ADR-009** (Argo Workflows on iad-ci) — CI/CD architecture and cross-compilation strategy