Implement the xtask gen-schema binary at xtask/src/bin/gen_schema.rs that
derives JSON Schema Draft 2020-12 from the Rust ExtractionResult type via
the schemars crate.
Changes:
- Add stable key sorting (sort_keys_recursive) for byte-identical output
- Set $id to stable URL: https://pdftract.com/schema/v1.0/pdftract.schema.json
- Set title to "pdftract Output v1.0"
- Add cargo alias `gen-schema` for convenient invocation
- Emit schema to docs/schema/v1.0/pdftract.schema.json
The schema is generated from the Rust types with schemars derives, ensuring
the JSON schema is always in sync with the source types.
Acceptance criteria:
- cargo gen-schema regenerates docs/schema/v1.0/pdftract.schema.json
- Generated schema validates against JSON Schema Draft 2020-12
- Schema $id is the stable URL
- Title is "pdftract Output v1.0"
- Stable ordering: regenerating twice produces byte-identical output
- All expected types appear in $defs (BlockJson, SpanJson, PageResult, etc.)
Note: page_type and confidence_source enums are not yet implemented in the
Rust types (marked as TODO in schema/mod.rs). These will be added by sibling
beads pdftract-1ob and pdftract-1f8we respectively.
Closes: pdftract-5nv9h
- Configure workspace with pdftract-core, pdftract-cli, pdftract-py members
- Add workspace.package metadata: version, edition, rust-version (1.78), license (MIT OR Apache-2.0)
- Add workspace.dependencies for shared external deps (anyhow, flate2, lzw, memchr, secrecy, serde, thiserror, tracing)
- Create .cargo/config.toml with CI and development build aliases
- All member crates reference workspace metadata via workspace = true
- pdftract-py configured as cdylib with pyo3 extension-module feature
Acceptance criteria:
- PASS: 3 workspace members listed by cargo metadata
- PASS: All crates use workspace metadata references
- WARN: cargo build fails due to code compilation errors (separate concern)
Refs: pdftract-279, plan lines 3343-3367
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>