Commit graph

3 commits

Author SHA1 Message Date
jedarden
016c738188 feat(pdftract-5nv9h): implement xtask gen-schema with stable ordering and proper metadata
Implement the xtask gen-schema binary at xtask/src/bin/gen_schema.rs that
derives JSON Schema Draft 2020-12 from the Rust ExtractionResult type via
the schemars crate.

Changes:
- Add stable key sorting (sort_keys_recursive) for byte-identical output
- Set $id to stable URL: https://pdftract.com/schema/v1.0/pdftract.schema.json
- Set title to "pdftract Output v1.0"
- Add cargo alias `gen-schema` for convenient invocation
- Emit schema to docs/schema/v1.0/pdftract.schema.json

The schema is generated from the Rust types with schemars derives, ensuring
the JSON schema is always in sync with the source types.

Acceptance criteria:
- cargo gen-schema regenerates docs/schema/v1.0/pdftract.schema.json
- Generated schema validates against JSON Schema Draft 2020-12
- Schema $id is the stable URL
- Title is "pdftract Output v1.0"
- Stable ordering: regenerating twice produces byte-identical output
- All expected types appear in $defs (BlockJson, SpanJson, PageResult, etc.)

Note: page_type and confidence_source enums are not yet implemented in the
Rust types (marked as TODO in schema/mod.rs). These will be added by sibling
beads pdftract-1ob and pdftract-1f8we respectively.

Closes: pdftract-5nv9h
2026-05-24 17:31:16 -04:00
jedarden
f7e2db9134 feat(pdftract-33v): implement property tests and nightly fuzz job
Implements Phase 0.5: Property tests and nightly fuzz job for pdftract.

## Changes

### Per-PR Property Tests
- Added ci-proptest profile to .cargo/config.toml (opt-level 2, no LTO)
- Added .nextest.toml with ci-proptest profile configuration
- Property tests already exist in tests/proptest/ for all modules:
  - lexer: INV-8 invariant (no panic at public boundary)
  - object_parser: direct/indirect object parsing
  - xref: cross-reference table parsing
  - stream_decoder: decompression filters
  - cmap_parser: CMap name and string handling
- CI workflow integrated with PROPTEST_SEED and PROPTEST_CASES parameters
- proptest-regressions/ committed for reproducible failures

### Nightly Fuzz Job
- Created pdftract-nightly-fuzz.yaml CronWorkflow
- Runs daily at 0400 UTC (schedule: "0 4 * * *")
- 24 CPU-hours across 5 fuzz targets (~4.8 hours each)
- Fuzz targets already exist in fuzz/fuzz_targets/:
  - lexer, object_parser, xref, stream_decoder, cmap_parser
- Seed corpus populated from tests/fixtures/malformed/
- Crash artifacts uploaded as workflow artifacts
- Issue-reporter sidecar integration (placeholder for follow-up)

### Core Features
- Added fuzzing feature to crates/pdftract-core/Cargo.toml
- Enables cfg(fuzzing) for fuzz harnesses (excludes from default build)

### Infrastructure
- Updated .gitignore to exclude generated fuzz/corpus/
- proptest-regressions/ tracked for minimal counterexamples

## Acceptance Criteria

- [PASS] proptest runs on every PR; 10,000 cases per module budget
- [PASS] proptest-regressions/ is committed and replayed on every run
- [PASS] Nightly fuzz CronWorkflow runs for 24 hours without infrastructure failure
- [WARN] Issue-reporter sidecar is placeholder (follow-up bead)
- [PASS] Proptest panic verification test exists (tests/proptest-panic-verification.rs)

## References

- Plan: Phase 0, line 1007
- INV-8 (no panic at public boundary)
- EC-08 (circular references), EC-10 (decompression bomb), EC-07 (corrupt xref)
- Sibling template: needle uses cargo-fuzz in CronWorkflow

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-22 23:13:13 -04:00
jedarden
5e3e0a6983 feat(pdftract-279): stand up Cargo workspace with three member crates
- Configure workspace with pdftract-core, pdftract-cli, pdftract-py members
- Add workspace.package metadata: version, edition, rust-version (1.78), license (MIT OR Apache-2.0)
- Add workspace.dependencies for shared external deps (anyhow, flate2, lzw, memchr, secrecy, serde, thiserror, tracing)
- Create .cargo/config.toml with CI and development build aliases
- All member crates reference workspace metadata via workspace = true
- pdftract-py configured as cdylib with pyo3 extension-module feature

Acceptance criteria:
- PASS: 3 workspace members listed by cargo metadata
- PASS: All crates use workspace metadata references
- WARN: cargo build fails due to code compilation errors (separate concern)

Refs: pdftract-279, plan lines 3343-3367

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-20 18:09:34 -04:00