- Fix format! macro parsing issue in build.rs by extracting doc comment - Move doc comment with example code outside format! string - Add verification note for pdftract-4pnmd documenting fallback implementation Files modified: - crates/pdftract-core/build.rs: Extract doc comment to fix format! parsing - notes/pdftract-4pnmd.md: Add verification note The non-Range server fallback implementation is already complete: - download_to_temp_and_mmap function downloads entire file to temp - TempMmapSource wrapper keeps temp file alive - Fallback logic integrated in open_source and open_remote - Diagnostics REMOTE_NO_RANGE_SUPPORT and REMOTE_INSUFFICIENT_DISK emitted - Ureq handles gzip decompression transparently Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
354 lines
13 KiB
Markdown
354 lines
13 KiB
Markdown
# Contributing to pdftract
|
|
|
|
Thank you for your interest in contributing to pdftract! This document covers the essential workflows for contributors.
|
|
|
|
## Licensing and Sign-off
|
|
|
|
pdftract is dual-licensed under **MIT OR Apache-2.0**. You may choose either license for your use.
|
|
|
|
### Developer Certificate of Origin (DCO)
|
|
|
|
This project requires a **Developer Certificate of Origin (DCO)** sign-off on all commits. This certifies that you wrote the code or have the right to pass it on as open-source.
|
|
|
|
**To sign your commits, use `git commit --signoff` (or `git commit -s`):**
|
|
|
|
```bash
|
|
git commit -s -m "feat: add some feature"
|
|
# The "Signed-off-by" trailer is added automatically
|
|
```
|
|
|
|
**No CLA is required.** The DCO is sufficient for this permissive-license project.
|
|
|
|
### Apache NOTICE File
|
|
|
|
The Apache-2.0 license includes a NOTICE file requirement, but pdftract does not ship a NOTICE file in the source distribution. This is intentional: the project maintains no contributor list outside of git history, and there are no third-party attribution notices required.
|
|
|
|
**Downstream redistributors MAY add a NOTICE file** when distributing pdftract as part of their own product. If you choose to add one, it should include:
|
|
- Attribution to the pdftract project
|
|
- A link to the original source repository
|
|
- Any modifications you made (if distributing a modified version)
|
|
|
|
The absence of a NOTICE file in the upstream distribution does not violate the Apache-2.0 license; the NOTICE requirement applies only when there is something to notice.
|
|
|
|
## Code of Conduct
|
|
|
|
This project adopts the [Contributor Covenant v2.1](CODE_OF_CONDUCT.md). All contributors are expected to uphold this code of conduct.
|
|
|
|
## Reporting Security Issues
|
|
|
|
If you discover a security vulnerability, please do **NOT** open a public issue or pull request. Instead, report it privately:
|
|
|
|
1. **Email (preferred):** [security@jedarden.com](mailto:security@jedarden.com)
|
|
- PGP-encrypted emails are strongly encouraged
|
|
- PGP key: [`docs/security/pgp-public-key.asc`](docs/security/pgp-public-key.asc)
|
|
|
|
2. **GitHub Private Vulnerability Reporting:**
|
|
- Use the [Security tab](https://github.com/jedarden/pdftract/security/advisories)
|
|
|
|
See [`SECURITY.md`](SECURITY.md) for our full disclosure policy, including:
|
|
- Supported versions and security fix timeline
|
|
- 90-day disclosure window
|
|
- CVE assignment process
|
|
- Safe harbor for good-faith researchers
|
|
|
|
## Development Setup
|
|
|
|
### Prerequisites
|
|
|
|
- **Rust 1.78 or later** — See [Minimum Supported Rust Version (MSRV)](#minimum-supported-rust-version-msrv) below
|
|
- **Git** — For cloning and committing
|
|
|
|
### OCR Feature Dependencies (Optional)
|
|
|
|
If you're developing OCR-related features (Phase 5), you'll need additional dependencies:
|
|
|
|
**Linux (Debian/Ubuntu):**
|
|
```bash
|
|
sudo apt-get install libleptonica-dev libtesseract-dev tesseract-ocr-eng
|
|
```
|
|
|
|
**macOS:**
|
|
```bash
|
|
brew install tesseract leptonica
|
|
```
|
|
|
|
**Windows:**
|
|
- Install Tesseract from the official installers: https://github.com/UB-Mannheim/tesseract/wiki
|
|
|
|
### Building
|
|
|
|
```bash
|
|
# Clone your fork
|
|
git clone https://github.com/YOUR_USERNAME/pdftract.git
|
|
cd pdftract
|
|
|
|
# Build the workspace
|
|
cargo build --workspace --locked
|
|
|
|
# Build release binaries
|
|
cargo build --release --workspace
|
|
```
|
|
|
|
### Testing
|
|
|
|
```bash
|
|
# Run all tests
|
|
cargo test --workspace --features default
|
|
|
|
# Run tests with output
|
|
cargo test --workspace --features default -- --nocapture
|
|
|
|
# Run a specific test
|
|
cargo test --workspace --features default test_name
|
|
```
|
|
|
|
## Minimum Supported Rust Version (MSRV)
|
|
|
|
The **Minimum Supported Rust Version (MSRV)** for pdftract is **1.78**. This is the oldest Rust version that can successfully build the project. The MSRV is declared in `Cargo.toml` via the `rust-version` field and enforced in CI.
|
|
|
|
### MSRV Policy
|
|
|
|
- **MSRV is 1.78** for the public crates (`pdftract-core`, `pdftract-cli`)
|
|
- **Bumping MSRV is a MINOR version event** — it requires at least one release of warning in the changelog
|
|
- **Never bump MSRV in a PATCH release** — this breaks downstream consumers without notice
|
|
- **CI enforces MSRV** — the `msrv-check` step builds with `rust:1.78-slim` and fails if newer Rust features are used
|
|
|
|
### When bumping MSRV
|
|
|
|
If you need to use a Rust feature newer than 1.78:
|
|
|
|
1. **Open an issue or ADR** documenting the required feature and why it's necessary
|
|
2. **Update all locations**:
|
|
- Root `Cargo.toml`: `[workspace.package] rust-version`
|
|
- CI workflow: `rust:` image tag in the `msrv-check` step
|
|
- README: MSRV badge
|
|
- `clippy.toml`: `msrv` setting
|
|
3. **Add a CHANGELOG entry** announcing the bump with at least one release of warning
|
|
4. **Wait for the next MINOR release** — never include in a PATCH
|
|
|
|
### Code review guidelines
|
|
|
|
- **New dependencies** whose declared MSRV exceeds 1.78 are rejected at code-review time
|
|
- The `msrv-check` CI step catches most MSRV violations automatically
|
|
- Reviewers should verify that new code doesn't use Rust 1.79+ features (e.g., `core::error::Error` in stable, `let-else`, certain async-fn-in-trait features)
|
|
|
|
## Lockfile Policy
|
|
|
|
pdftract uses a workspace-level `Cargo.lock` file that is **checked into version control**. This is intentional: release reproducibility requires that every build from the same commit produces byte-identical artifacts. All CI steps run with `--locked --frozen` to enforce this.
|
|
|
|
### Updating Dependencies
|
|
|
|
When adding or updating dependencies:
|
|
|
|
1. **Targeted updates (preferred):** Update a specific crate and its dependencies:
|
|
```bash
|
|
cargo update -p crate-name
|
|
```
|
|
|
|
2. **Full updates:** Only during release preparation:
|
|
```bash
|
|
cargo update
|
|
```
|
|
|
|
3. **Commit the lockfile:** Always commit `Cargo.lock` alongside any `Cargo.toml` changes:
|
|
```bash
|
|
git add Cargo.toml Cargo.lock
|
|
git commit -m "deps: upgrade crate-name to X.Y.Z"
|
|
```
|
|
|
|
### CI Enforcement
|
|
|
|
- The `pdftract-ci` Argo workflow runs `cargo check --locked --frozen` as the first step.
|
|
- A PR that edits `Cargo.toml` without updating `Cargo.lock` will fail CI.
|
|
- Two consecutive builds of `pdftract-build-binaries` against the same tag must produce identical binaries (verified by SHA256 comparison).
|
|
|
|
### Why Library Crates Have Cargo.lock
|
|
|
|
The Rust ecosystem convention is that library crates should not check in `Cargo.lock`, allowing downstream consumers to resolve their own dependency versions. pdftract departs from this convention because:
|
|
|
|
- **Release reproducibility** is paramount for SLSA Level 3 provenance.
|
|
- The workspace produces both libraries (`pdftract-core`) and binaries (`pdftract-cli`, `pdftract-py`).
|
|
- A single workspace-level `Cargo.lock` applies to all members.
|
|
- Downstream consumers can still ignore the lockfile by using `cargo build --frozen` with their own lockfile, or by vendoring.
|
|
|
|
## Local Validation Before Opening a PR
|
|
|
|
Before submitting a pull request, please run the following commands locally to ensure your changes pass all quality gates:
|
|
|
|
```bash
|
|
# 1. Run all tests (must be all green)
|
|
cargo test --workspace --features default
|
|
|
|
# 2. Lint with clippy (no warnings allowed)
|
|
cargo clippy --all-targets --features default -- -D warnings
|
|
|
|
# 3. Check binary size (must be within budget; target <= 4 MB stripped)
|
|
cargo bloat --release --features default
|
|
|
|
# 4. Check for security advisories (no medium+ issues)
|
|
cargo audit
|
|
|
|
# 5. Check license compliance (no rejected licenses)
|
|
cargo deny check licenses
|
|
|
|
# 6. Check code formatting
|
|
cargo fmt --check
|
|
```
|
|
|
|
**Why these checks?** These exact commands are run in the `pdftract-ci` Argo workflow. A green local run predicts a green CI run, reducing review iteration cycles.
|
|
|
|
### Binary Size Budget
|
|
|
|
The release binary must be <= 4 MB when stripped. `cargo bloat` helps identify functions contributing most to binary size. If your PR adds significant code:
|
|
- Run `cargo bloat --release --features default --crates pdftract-cli`
|
|
- Check the top functions in the output
|
|
- Consider if large dependencies can be made optional or feature-gated
|
|
|
|
## CI on Forks — The Argo-CI Caveat
|
|
|
|
> **IMPORTANT:** Because CI runs on the private `iad-ci` cluster, external contributors cannot trigger CI from their fork.
|
|
|
|
### How It Works
|
|
|
|
1. **Fork and open a pull request** against `jedarden/pdftract:main`
|
|
2. **A maintainer will trigger the `pdftract-ci` Argo workflow** against your branch
|
|
3. **Results are posted as a PR comment** once the workflow completes
|
|
|
|
### Expected Response Time
|
|
|
|
- Maintainer-triggered CI: **within 48 hours**
|
|
- You'll receive a comment on your PR with the full CI log
|
|
|
|
### Why This Model?
|
|
|
|
The `iad-ci` cluster is a private Rackspace Spot cluster accessed via kubectl-proxy over Tailscale. External forks do not have credentials to access this cluster, so they cannot self-serve CI runs. This is unusual, but it allows us to run CI on infrastructure we control without exposing cluster credentials publicly.
|
|
|
|
### Local Validation is Critical
|
|
|
|
Since you cannot trigger CI yourself, **please run the full local validation checklist** before opening your PR. This minimizes back-and-forth cycles when the maintainer-triggered CI fails.
|
|
|
|
## Pull Request Template
|
|
|
|
All pull requests must follow the [PR template](.github/PULL_REQUEST_TEMPLATE.md). The template requires:
|
|
|
|
- **Linked issue or RFC** — Every PR should reference an issue or design document
|
|
- **Scope statement** — Which Phase / which Acceptance Scenario does this address?
|
|
- **Test plan** — How did you verify this works?
|
|
- **Manual-test evidence** — Screenshots, terminal output, or example runs
|
|
- **Performance impact** — If hot-path code was touched, include benchmark results
|
|
|
|
## Commit Message Style
|
|
|
|
This project uses **Conventional Commits** for commit messages. Release notes are auto-generated from commit history using `git-cliff`.
|
|
|
|
### Format
|
|
|
|
```
|
|
<type>(<scope>): <short summary>
|
|
|
|
[optional body]
|
|
|
|
[optional footer]
|
|
```
|
|
|
|
### Types
|
|
|
|
- `feat:` — A new feature
|
|
- `fix:` — A bug fix
|
|
- `perf:` — A performance improvement
|
|
- `docs:` — Documentation changes
|
|
- `chore:` — Maintenance tasks (updates, refactoring, tooling)
|
|
- `test:` — Test changes
|
|
- `BREAKING CHANGE:` — A breaking change (include in body or footer)
|
|
|
|
### Examples
|
|
|
|
```bash
|
|
feat(ocr): add Tesseract integration for phase 5
|
|
fix(font): handle missing /Widths in Type 3 fonts
|
|
perf(extract): cache page tree parsing results
|
|
docs(contributing): add Argo-CI caveat section
|
|
chore(deps): upgrade lodepng to 0.9.0
|
|
```
|
|
|
|
## Issue Triage
|
|
|
|
We use issue templates to ensure all necessary information is provided upfront. When opening an issue, please use the appropriate template:
|
|
|
|
- **Bug report** — Must include `pdftract doctor` output
|
|
- **Feature request** — Describe the use case and proposed API
|
|
- **Performance regression** — Include before/after benchmarks
|
|
- **Security advisory** — Redirects to private disclosure (see [Reporting Security Issues](#reporting-security-issues))
|
|
|
|
See [`.github/ISSUE_TEMPLATE/`](.github/ISSUE_TEMPLATE/) for the full list.
|
|
|
|
## Security Policy: NEVER-Log Secrets
|
|
|
|
**Critical:** pdftract enforces a strict **NEVER-log secrets** policy to prevent credential leakage in logs, crash dumps, and SIEM systems.
|
|
|
|
### Forbidden Patterns
|
|
|
|
The following content MUST NEVER appear in logs at any level (trace, debug, info, warn, error):
|
|
|
|
1. **Credential values:**
|
|
- Passwords, API keys, bearer tokens, session IDs
|
|
- `SecretString` inner values (use `secrecy::SecretString` for all credentials)
|
|
- Auth tokens for MCP, HTTP sources, or any external service
|
|
|
|
2. **PDF bytes and extracted text:**
|
|
- Raw PDF stream data (compressed or uncompressed)
|
|
- Extracted text content (may contain sensitive documents)
|
|
- Image data (embedded images may contain sensitive information)
|
|
|
|
3. **HTTP headers:**
|
|
- `Authorization`, `Cookie`, `Proxy-Authorization` header values
|
|
- Use `redact_headers_for_log()` for any request logging
|
|
|
|
### Safe Patterns
|
|
|
|
These are acceptable to log:
|
|
|
|
- **Metadata only:** File paths, URLs without query params, content hashes
|
|
- **Diagnostic codes:** `TH-03`, `STRUCT_MISSING_KEY` (not the full message text)
|
|
- **Metrics:** Request duration, byte counts, error codes
|
|
- **Sanitized data:** Strings with known sensitive patterns removed (document the sanitization)
|
|
|
|
### Implementation Requirements
|
|
|
|
1. **Use `secrecy::SecretString`** for all credential values:
|
|
```rust
|
|
use secrecy::SecretString;
|
|
let password = SecretString::new("value".into());
|
|
// Debug/Display impls print "[REDACTED]"
|
|
```
|
|
|
|
2. **Never log request bodies** that might contain user data. Log only:
|
|
- Request method and path
|
|
- Response status
|
|
- Header names with redacted values
|
|
|
|
3. **CI gate enforcement:** A grep-based script scans every PR for forbidden patterns and fails on:
|
|
- `log::info!` / `tracing::info!` / `println!` / `eprintln!` with variables named:
|
|
- `password`, `token`, `credential`, `secret`, `api_key`, `auth_header`
|
|
- Any log of `body`, `content`, `text`, `data` variables (requires reviewer judgment)
|
|
|
|
### Verification
|
|
|
|
A fuzz test (`tests/log_secret_fuzz.rs`) runs with 10,000 random inputs and verifies that:
|
|
- No credential value appears in any captured log output
|
|
- SecretString values always render as `[REDACTED]`
|
|
- Authorization headers are redacted in request logs
|
|
|
|
### See Also
|
|
|
|
- [SECURITY.md](SECURITY.md) — Vulnerability reporting policy
|
|
- [Phase 6 audit logging policy](docs/plan/plan.md) — Full audit log design
|
|
|
|
## Getting Help
|
|
|
|
- **Documentation:** Check [`docs/`](docs/) for design docs and ADRs
|
|
- **Issues:** Search existing issues before opening a new one
|
|
- **Discussions:** Use GitHub Discussions for questions and RFCs
|
|
- **Security:** See [SECURITY.md](SECURITY.md) for vulnerability reporting
|
|
|
|
Thank you for contributing to pdftract!
|