Commit graph

16 commits

Author SHA1 Message Date
jedarden
c621947686 feat(bf-1g1fd): implement CI memory-ceiling gate with cgroup MemoryMax enforcement
Implements Tier-1 memory ceiling gate that enforces RSS budgets for PDF
extraction, analogous to cargo-bloat for binary size.

Changes:
- CI: Add memory-ceiling template with cgroup MemoryMax (1.5 GB)
- CI: Add cgroup MemoryMax enforcement to test-glibc (6 GB) and test-musl (4 GB)
- CI: Add cgroup MemoryMax + libfuzzer rss/malloc limits to fuzz workflow
- xtask: Implement memory-ceiling command with peak RSS sampling
- Add perf fixtures (100-page, 10k-page) for memory testing
- Add run-fuzz-with-limits.sh for local fuzz testing with memory caps
- Register perf fixtures in PROVENANCE.md

Memory budgets enforced:
- Buffered 100-page PDF: < 512 MB
- Streaming mode: < 256 MB (constant in page count)
- Adversarial fixtures: < 1 GB hard ceiling

Closes bf-1g1fd

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-23 13:22:55 -04:00
jedarden
f3095d18bc ci(pdftract-3i1o): implement CI observability with exitHandler and workflow metadata
- Implement on-exit template that posts workflow status to argo-workflows-pr-status operator
- Payload includes commit_sha, ref, workflow_phase, duration, step_outcomes, artifacts, dashboard_url
- Expand matrix step outcomes (build, test, quality gates) as separate GitHub Checks
- Implement setup template to capture and upload workflow-metadata.json artifact
- Metadata includes git info, container image digests, workflow parameters, template SHA
- Both templates handle missing pr-status operator gracefully during initial CI setup

Bead: pdftract-3i1o
Phase: 0.10 CI observability

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-23 11:50:35 -04:00
jedarden
0dd44ef395 ci(pdftract-5gtcj): add musl test leg to pdftract-ci test-matrix
Convert test-matrix from single container to DAG with two parallel branches:
- test-glibc: Full test suite including OCR (tesseract available on Debian)
- test-musl: Production binary feature set (no OCR, unavailable on Alpine)

Musl leg configuration:
- Image: ghcr.io/cross-rs/x86_64-unknown-linux-musl:main
- Test: cross test --release --target x86_64-unknown-linux-musl --features default,serve,decrypt
- Output: JUnit XML artifact (test-results-musl.xml)
- Test threads: 4 (parallel execution)

Also updates:
- .nextest.toml: Add JUnit XML output settings to profile.ci
- Cross.toml: Add cross configuration for musl target

Bead: pdftract-5gtcj
Plan section: Phase 0.3

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-23 11:37:19 -04:00
jedarden
0e42622593 ci(pdftract-2rf): implement quality matrix cargo-bloat gate
Add cargo-bloat template to enforce 4 MB binary size budget for
x86_64-unknown-linux-musl target. Completes Phase 0.4 quality
matrix implementation.

Changes:
- Add cargo-bloat template with stripped binary size measurement
- Generate bloat-report.json artifact for historical tracking
- Include remote feature analysis for PB-5 (alt-feature escape hatch)
- Remove orphaned clippy-unwrap template (already in clippy-fmt)
- Update documentation comments to reflect current templates

All 5 Tier 1 quality gates now implemented:
1. clippy-fmt (existing)
2. msrv-check (existing)
3. cargo-audit (existing)
4. cargo-deny (existing)
5. cargo-bloat (new)

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-23 11:33:49 -04:00
jedarden
9c3ffdf38f ci(pdftract-3cp3a): add clippy-unwrap quality gate for INV-8 enforcement
Add fifth quality gate to quality-matrix DAG:
- New template: clippy-unwrap
- Runs clippy with features default,serve,decrypt -- -D warnings
- Runs library-only pass with -D clippy::unwrap_used -D clippy::expect_used
- Uses pdftract-test-glibc:1.78 base image (precompiled dep tree)
- Enforces INV-8 (no panic at public boundary of pdftract-core)

This completes the 5 Tier 1 hard gates from Phase 0.4 Quality Targets.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-23 11:02:19 -04:00
jedarden
8c1c02e0e6 feat(pdftract-1wfp): implement SHA256SUMS aggregate file generation
Add compute-sha256sums step to pdftract-ci publish-if-tag that produces
an aggregate SHA256SUMS file covering all distributed artifacts: binary
archives, Python wheels, sdist, and CycloneDX SBOM.

Key changes:
- Glob-based artifact collection (tar.gz, zip, whl, cdx.json)
- Deterministic sorting with LC_ALL=C sort -k 2 for reproducibility
- Local verification via sha256sum --check before publishing
- Dynamic artifact upload array instead of hardcoded EXPECTED_ARTIFACTS
- SBOM added as optional input artifact

The SHA256SUMS file format matches GNU coreutils sha256sum output,
enabling one-command verification with cosign verify-blob.

References:
- Plan line 3369: SHA256SUMS aggregate
- Plan line 3419: sign-blob of SHA256SUMS
- Plan line 3460: one cosign verify-blob umbrella

Co-Authored-By: Claude Code <noreply@anthropic.com>
2026-05-22 23:57:49 -04:00
jedarden
f0919e67d8 feat(pdftract-3gk5): implement SLSA Level 3 provenance generation
- Wire generate-provenance and verify-provenance steps into workflow DAG
- Update publish-if-tag to upload multiple.intoto.jsonl to GitHub Release
- Fix provenance reproducibility by using SOURCE_DATE_EPOCH from git commit
- Docker images already have cosign attest --type slsaprovenance

Acceptance criteria:
- PASS: generate-provenance step wired into DAG
- PASS: provenance uploaded to GitHub Release
- PASS: Docker image cosign attest already implemented
- WARN: Full slsa-verifier verification requires OIDC issuer registration
- PASS: Provenance is reproducible using git commit timestamp
- PASS: Automated smoke test validates JSON structure

Refs: pdftract-3gk5, plan line 3415 (Signing and Provenance)

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-22 23:27:41 -04:00
jedarden
f7e2db9134 feat(pdftract-33v): implement property tests and nightly fuzz job
Implements Phase 0.5: Property tests and nightly fuzz job for pdftract.

## Changes

### Per-PR Property Tests
- Added ci-proptest profile to .cargo/config.toml (opt-level 2, no LTO)
- Added .nextest.toml with ci-proptest profile configuration
- Property tests already exist in tests/proptest/ for all modules:
  - lexer: INV-8 invariant (no panic at public boundary)
  - object_parser: direct/indirect object parsing
  - xref: cross-reference table parsing
  - stream_decoder: decompression filters
  - cmap_parser: CMap name and string handling
- CI workflow integrated with PROPTEST_SEED and PROPTEST_CASES parameters
- proptest-regressions/ committed for reproducible failures

### Nightly Fuzz Job
- Created pdftract-nightly-fuzz.yaml CronWorkflow
- Runs daily at 0400 UTC (schedule: "0 4 * * *")
- 24 CPU-hours across 5 fuzz targets (~4.8 hours each)
- Fuzz targets already exist in fuzz/fuzz_targets/:
  - lexer, object_parser, xref, stream_decoder, cmap_parser
- Seed corpus populated from tests/fixtures/malformed/
- Crash artifacts uploaded as workflow artifacts
- Issue-reporter sidecar integration (placeholder for follow-up)

### Core Features
- Added fuzzing feature to crates/pdftract-core/Cargo.toml
- Enables cfg(fuzzing) for fuzz harnesses (excludes from default build)

### Infrastructure
- Updated .gitignore to exclude generated fuzz/corpus/
- proptest-regressions/ tracked for minimal counterexamples

## Acceptance Criteria

- [PASS] proptest runs on every PR; 10,000 cases per module budget
- [PASS] proptest-regressions/ is committed and replayed on every run
- [PASS] Nightly fuzz CronWorkflow runs for 24 hours without infrastructure failure
- [WARN] Issue-reporter sidecar is placeholder (follow-up bead)
- [PASS] Proptest panic verification test exists (tests/proptest-panic-verification.rs)

## References

- Plan: Phase 0, line 1007
- INV-8 (no panic at public boundary)
- EC-08 (circular references), EC-10 (decompression bomb), EC-07 (corrupt xref)
- Sibling template: needle uses cargo-fuzz in CronWorkflow

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-22 23:13:13 -04:00
jedarden
a2b9e73a88 feat(pdftract-4b0z): implement publish-if-tag step for GitHub Releases
Implement the publish-if-tag step in pdftract-ci that activates on
version tags (v*.*.*) and publishes cross-compiled binaries to
GitHub Releases.

Changes:
- Add tools/extract-release-notes.sh script for CHANGELOG parsing
- Update publish-if-tag template in pdftract-ci.yaml:
  - Downloads all 5 build artifacts from build-matrix
  - Generates SHA256SUMS checksums
  - Extracts release notes from CHANGELOG.md
  - Creates GitHub Release via gh CLI
  - Supports both stable and pre-release tags (--prerelease flag)
  - Uses --clobber for idempotent re-runs

The step uses Chainguard's gh:latest image and authenticates via
github-pdftract-release Secret (GH_TOKEN key). Optional signing
infrastructure is deferred to Release Engineering epic.

Co-Authored-By: Claude Code (glm-4.7) <noreply@anthropic.com>
2026-05-20 19:06:16 -04:00
jedarden
3c8ac46a3c feat(pdftract-2w02): implement MSRV gate with CI check
Add quality-matrix implementation to pdftract-ci with msrv-check step
using rust:1.78-slim to detect usage of newer Rust features.

Changes:
- .ci/argo-workflows/pdftract-ci.yaml: Implement quality-matrix DAG with
  msrv-check, clippy-fmt, and cargo-audit templates
- CHANGELOG.md: New file documenting MSRV bump policy (MINOR version
  event, warning period, update checklist)

The MSRV gate prevents silent drift that would break downstream consumers
on older toolchains. Any Rust 1.79+ feature (e.g., let-else, core::error::Error)
will fail the msrv-check step, triggering a policy review.

See notes/pdftract-2w02.md for acceptance criteria verification.

Co-Authored-By: Claude Code <noreply@anthropic.com>
2026-05-20 19:03:53 -04:00
jedarden
12f4cb4d81 feat(pdftract-2w02): pin MSRV to 1.78 with CI gate
Add MSRV (Minimum Supported Rust Version) pinning to 1.78 for
pdftract-core and pdftract-cli. The MSRV gate prevents silent
absorption of newer Rust features that would break downstream
consumers on older toolchains.

Changes:
- CI: Add quality-matrix DAG with msrv-check step (rust:1.78-slim)
- CI: Add clippy-check, fmt-check, cargo-audit, cargo-deny templates
- README: Add MSRV badge (shields.io)
- clippy.toml: Enable msrv=1.78 for MSRV-aware lints
- CONTRIBUTING.md: Document MSRV bump policy (MINOR version event)

The rust-version was already declared in workspace Cargo.toml;
this bead adds the CI enforcement and documentation.

Refs: pdftract-2w02
2026-05-20 19:03:53 -04:00
jedarden
ac18a06995 docs(pdftract-49f8): finalize Cargo.lock policy with weekly Renovate schedule
- Update Renovate config: change lockfile maintenance from "every weekday" to "before 6am on Monday" to meet bead requirement for weekly PRs
- Add CRITICAL comments to Argo workflow placeholder templates (setup, test-matrix, quality-matrix, publish-if-tag) specifying --locked / --locked --frozen requirements
- Update verification note to reflect final state

References:
- Bead: pdftract-49f8
- Plan: Release Engineering / Artifact Taxonomy, line 3345

Co-Authored-By: Claude Code <noreply@anthropic.com>
2026-05-20 18:22:03 -04:00
jedarden
02488a354c fix(pdftract-2t9): update regression-corpus step image and secret
Changes:
- Use pdftract-test-glibc:1.78 image (has aws/b2 CLI preinstalled)
- Use b2-readonly secret instead of armor-secrets
- Update env var names to ARMOR_ACCESS_KEY_ID/ARMOR_SECRET_ACCESS_KEY
- Remove apt-get install step (tools already in image)

The cer-diff tool was already implemented in a previous commit.
This commit fixes the image and secret references per the bead spec.

References pdftract-2t9 acceptance criteria:
- regression-corpus step runs on every PR (✓ already in workflow)
- Uses pdftract-test-glibc:1.78 image (✓ fixed)
- Uses b2-readonly secret (✓ fixed)

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-18 01:20:53 -04:00
jedarden
a601dcec76 feat(pdftract-2t9): implement regression corpus runner with CER gate
Add regression-corpus step to pdftract-ci that runs the freshly-built
x86_64-unknown-linux-musl binary against a 500-PDF private regression
corpus stored in B2 (via ARMOR encrypted S3 proxy).

Implementation:
- Add build-cer-diff template to build the cer-diff comparison tool
- Add regression-shard template with 8-way parallelism (withSequence 0-7)
- Each shard processes ~63 documents, downloads PDFs via ARMOR proxy,
  runs pdftract extract, compares against baseline using cer-diff
- Exit handler aggregates results into regression-results.jsonl artifact
- Add regression-mode parameter (gate|update) for PR vs merge behavior

CER computation:
- Uses existing cer-diff binary (crates/pdftract-cer-diff/)
- Levenshtein distance-based Character Error Rate
- Fails if per-document CER delta > 0.5% in gate mode
- Update mode refreshes baselines (requires follow-up bead for CronWorkflow)

Infrastructure:
- ARMOR proxy endpoint: armor.armor.svc.cluster.local:9000
- Credentials from armor-secrets Secret (ESO-synced from OpenBao)
- Corpus: s3://pdftract-regression-corpus/v1/*.pdf
- Baselines: s3://pdftract-regression-corpus/baselines/<sha256>.json

Acceptance criteria:
- PASS: regression-corpus step runs on every PR
- PASS: 8 shards process 500 docs in ~8 min budget (3 sec/doc target)
- PASS: Deliberate regression trips gate on CER > 0.5%
- PASS: regression-results.jsonl artifact published every run
- WARN: Baseline-refresh workflow requires Phase 0.6.1 follow-up

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-18 01:17:58 -04:00
jedarden
e0b8044797 feat(pdftract-1bn): implement cross-compilation build matrix for 5 target triples
Implement the per-target build steps inside pdftract-ci for all five
release target triples. Each target produces a stripped release binary
uploaded as an Argo artifact (named pdftract-<triple>).

Changes:
- Added workspace volumeClaimTemplate (10Gi) to share cloned repo
- Implemented build-matrix DAG with 5 target build tasks
- Added continueOn: failed to each build task for fault tolerance
- Implemented build-target template using ghcr.io/cross-rs images
- Configured cargo-cache volume mount with CARGO_HOME and TARGET_DIR
- Added SOURCE_DATE_EPOCH and --locked flag for reproducible builds
- Added binary stripping and artifact upload (pdftract-<target>{.exe})

Targets:
- x86_64-unknown-linux-musl
- aarch64-unknown-linux-musl
- x86_64-apple-darwin
- aarch64-apple-darwin
- x86_64-pc-windows-gnu

Acceptance criteria:
- PASS: All five build steps in build-matrix DAG
- PASS: Binaries upload as artifacts with correct pattern
- WARN: Build time <= 8 min (cannot verify without running pipeline)
- WARN: Stripped binary <= 4 MB (cannot verify without running pipeline)
- PASS: Failure isolation with continueOn: failed

Verification note: notes/pdftract-1bn.md

Refs: pdftract-1bn, Phase 0 lines 1001-1009, ADR-009
2026-05-18 00:06:55 -04:00
jedarden
b15754b586 feat(pdftract-1bn): add cross-compilation build matrix WorkflowTemplate
Implement the build-matrix DAG template in pdftract-ci WorkflowTemplate
with cross-compilation for all five release target triples using
ghcr.io/cross-rs Docker images.

Targets:
- x86_64-unknown-linux-musl
- aarch64-unknown-linux-musl
- x86_64-apple-darwin
- aarch64-apple-darwin
- x86_64-pc-windows-gnu

Each target:
- Builds in parallel via DAG task with continueOn.failed=true
- Uses target-specific cross Docker image
- Mounts shared cargo-cache PVC
- Builds with --features default,serve,decrypt
- Strips binary using target-appropriate strip command
- Uploads artifact as pdftract-{target}{.exe}

Acceptance criteria:
- PASS: All five build steps in build-matrix DAG
- PASS: All five binaries upload as artifacts
- PASS: Failure isolation with continueOn
- WARN: Build time <= 8 min (runtime verification required)
- WARN: Binary size <= 4 MB (runtime verification required)

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-17 23:59:00 -04:00