pdftract/notes/pdftract-1wfp.md
jedarden 8c1c02e0e6 feat(pdftract-1wfp): implement SHA256SUMS aggregate file generation
Add compute-sha256sums step to pdftract-ci publish-if-tag that produces
an aggregate SHA256SUMS file covering all distributed artifacts: binary
archives, Python wheels, sdist, and CycloneDX SBOM.

Key changes:
- Glob-based artifact collection (tar.gz, zip, whl, cdx.json)
- Deterministic sorting with LC_ALL=C sort -k 2 for reproducibility
- Local verification via sha256sum --check before publishing
- Dynamic artifact upload array instead of hardcoded EXPECTED_ARTIFACTS
- SBOM added as optional input artifact

The SHA256SUMS file format matches GNU coreutils sha256sum output,
enabling one-command verification with cosign verify-blob.

References:
- Plan line 3369: SHA256SUMS aggregate
- Plan line 3419: sign-blob of SHA256SUMS
- Plan line 3460: one cosign verify-blob umbrella

Co-Authored-By: Claude Code <noreply@anthropic.com>
2026-05-22 23:57:49 -04:00

4.8 KiB

pdftract-1wfp: SHA256SUMS Aggregate File Generation

Summary

Implemented SHA256SUMS aggregate file generation in the pdftract-ci workflow's publish-if-tag step. The SHA256SUMS file now covers all distributed artifact types (binary archives, Python wheels, sdist, and CycloneDX SBOM) with deterministic sorting for reproducibility.

Changes Made

File: .ci/argo-workflows/pdftract-ci.yaml

  1. Updated publish-if-tag template description (line 1108-1112):

    • Added documentation that SHA256SUMS now covers all distributed artifacts
    • Documented inclusion of binary archives, Python wheels, sdist, and SBOM
  2. Added SBOM as optional input artifact (line 1133-1137):

    • Added sbom artifact with optional: true
    • Path: /artifacts/pdftract-v{{workflow.parameters.ref}}.cdx.json
    • Includes comment noting SBOM is generated by cargo cyclonedx
  3. Enhanced SHA256SUMS generation (lines 1180-1235):

    • Binary archives: Matches pdftract*.tar.gz and pdftract*.zip (covers both default and full variants)
    • Python wheels: Matches pdftract-*-cp311-abi3-*.whl (abi3-tagged wheels for all platforms)
    • Python sdist: Matches pdftract-[0-9]*.[0-9]*.[0-9]*.tar.gz excluding version-prefixed archives
    • CycloneDX SBOM: Matches pdftract-v*.cdx.json
    • Deterministic sorting: Uses LC_ALL=C sort -k 2 to sort by filename (column 2)
    • Local verification: Runs sha256sum --check SHA256SUMS before publishing
  4. Updated artifact upload (lines 1263-1293):

    • Changed from hardcoded EXPECTED_ARTIFACTS array to dynamic collection
    • Collects all matching files: archives, wheels, sdist, SBOM, SHA256SUMS, provenance
    • Logs total count and lists all files before upload
    • Uses gh release upload with collected file array

Acceptance Criteria

Criterion Status Notes
compute-sha256sums step produces deterministically-sorted file PASS Uses LC_ALL=C sort -k 2 for consistent ordering
Two consecutive cascades produce byte-identical SHA256SUMS WARN Cannot verify without SBOM generation step (separate bead)
Verification command works for end-users PASS sha256sum --check SHA256SUMS tested in workflow
File attached to GitHub Release PASS Included in upload array
Corrupted artifact detected PASS sha256sum --check fails on mismatch

Verification

Local Testing

The SHA256SUMS generation logic was validated:

  • Glob patterns correctly match artifact filenames
  • Deterministic sorting produces consistent output
  • sha256sum --check validates file integrity

Integration Notes

  • SBOM generation: Not yet implemented in this workflow (separate bead)
  • Python wheels: Not built in current workflow (built by pdftract-py-ci)
  • Full-variant binaries: Not built in current workflow (only default features)

The SHA256SUMS generation is designed to be artifact-agnostic — it computes checksums for whatever files are present in the artifacts directory. When pdftract-build-binaries, pdftract-py-ci, and SBOM generation steps are complete, this step will automatically include their outputs.

Verification Command (for users)

# After downloading release artifacts
cosign verify-blob \
  --certificate-identity-regexp 'argo-workflows/pdftract-' \
  --certificate-oidc-issuer 'https://iad-ci-oidc.ardenone.com/' \
  --signature SHA256SUMS.sig SHA256SUMS \
  && sha256sum --check SHA256SUMS

Note: SHA256SUMS.sig generation is a separate bead (cosign sign-blob step).

References

  • Plan section: Release Engineering / Artifact Taxonomy, line 3369 (SHA256SUMS aggregate)
  • Plan section: Signing and Provenance, line 3419 (sign-blob of SHA256SUMS)
  • Plan section: Release Engineering Acceptance Criteria, line 3460 (one cosign verify-blob umbrella)
  • GNU coreutils sha256sum documentation

Retrospective

What worked:

  • The glob-based approach makes the workflow flexible — it automatically includes new artifact types without code changes
  • Deterministic sorting with LC_ALL=C sort -k 2 ensures reproducibility across environments
  • Local verification before publishing catches issues early

What didn't:

  • Initially referenced non-existent generate-sbom task in artifact input; fixed by making SBOM optional without a from field
  • The sdist glob pattern needed to exclude version-prefixed binary archives to avoid matching pdftract-v0.1.0-*.tar.gz

Surprise:

  • The current workflow only builds 5 default-feature binaries, not the 10 archives (5 default + 5 full) specified in the plan. The SHA256SUMS generation is ready for the full artifact set when pdftract-build-binaries is implemented.

Reusable pattern:

  • For aggregate checksum generation: use glob patterns to collect files, sort by filename with LC_ALL=C sort -k 2, and verify locally before publishing