docs(contributing): add Argo-CI caveat, DCO sign-off, and contributor templates
- Restructured CONTRIBUTING.md with all nine required sections: - Project licensing (MIT OR Apache-2.0, DCO sign-off required) - Code of conduct (Contributor Covenant v2.1) - Security reporting (link to SECURITY.md) - Development setup (with OCR dependencies) - Local validation checklist (6 commands matching pdftract-ci) - CI on forks caveat (maintainer-triggered, 48-hour response) - PR template requirements - Commit message style (Conventional Commits) - Issue triage - Created CODE_OF_CONDUCT.md (Contributor Covenant v2.1) - Created .github/PULL_REQUEST_TEMPLATE.md with required fields: - Linked issue or RFC - Scope statement (Phase / Acceptance Scenario) - Test plan - Manual-test evidence - Performance impact - Created issue templates: - bug_report.md (with pdftract doctor output requirement) - feature_request.md (with use case and proposed solution) - performance_regression.md (with baseline vs current) - Updated README.md with Contributing section linking to CONTRIBUTING.md - Added footer links to CONTRIBUTING.md in all templates Closes: pdftract-i9rk Verification: notes/pdftract-i9rk.md Signed-off-by: jedarden <github@jedarden.com>
This commit is contained in:
parent
db7fcf0097
commit
97fecb7b4b
8 changed files with 668 additions and 35 deletions
66
.github/ISSUE_TEMPLATE/bug_report.md
vendored
Normal file
66
.github/ISSUE_TEMPLATE/bug_report.md
vendored
Normal file
|
|
@ -0,0 +1,66 @@
|
||||||
|
---
|
||||||
|
name: Bug report
|
||||||
|
about: Report a problem with pdftract
|
||||||
|
title: '[BUG] '
|
||||||
|
labels: bug
|
||||||
|
assignees: ''
|
||||||
|
---
|
||||||
|
|
||||||
|
## Bug Description
|
||||||
|
|
||||||
|
A clear and concise description of what the bug is.
|
||||||
|
|
||||||
|
## PDF File That Triggered the Bug
|
||||||
|
|
||||||
|
**IMPORTANT:** Please attach the PDF file that causes the bug. If the file is confidential, please sanitize it first or describe the issue in detail.
|
||||||
|
|
||||||
|
- **File:** (attach PDF or describe the issue)
|
||||||
|
- **File size:** (if applicable)
|
||||||
|
- **PDF generator:** (e.g., Acrobat, Word, Ghostscript)
|
||||||
|
|
||||||
|
## `pdftract doctor` Output
|
||||||
|
|
||||||
|
**REQUIRED:** Run `pdftract doctor` and paste the output here.
|
||||||
|
|
||||||
|
```text
|
||||||
|
(paste output here)
|
||||||
|
```
|
||||||
|
|
||||||
|
## Steps to Reproduce
|
||||||
|
|
||||||
|
1. Run this command: `...`
|
||||||
|
2. With this PDF file: `...`
|
||||||
|
3. See this error: `...`
|
||||||
|
|
||||||
|
## Expected Behavior
|
||||||
|
|
||||||
|
What should have happened?
|
||||||
|
|
||||||
|
## Actual Behavior
|
||||||
|
|
||||||
|
What actually happened? Include error messages, stack traces, or incorrect output.
|
||||||
|
|
||||||
|
## Environment
|
||||||
|
|
||||||
|
- **OS:** (e.g., Ubuntu 22.04, macOS 14, Windows 11)
|
||||||
|
- **pdftract version:** (run `pdftract --version`)
|
||||||
|
- **Installation method:** (e.g., cargo install, brew, compiled from source)
|
||||||
|
- **Rust version:** (run `rustc --version`)
|
||||||
|
|
||||||
|
## Additional Context
|
||||||
|
|
||||||
|
Add any other context about the problem here:
|
||||||
|
|
||||||
|
- Logs (attach or paste)
|
||||||
|
- Screenshots (if applicable)
|
||||||
|
- Related issues or PRs
|
||||||
|
- Workarounds you've found
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
**Note:** For help with development or contributing to pdftract, see [`CONTRIBUTING.md`](../../CONTRIBUTING.md).
|
||||||
|
|
||||||
|
- Logs (attach or paste)
|
||||||
|
- Screenshots (if applicable)
|
||||||
|
- Related issues or PRs
|
||||||
|
- Workarounds you've found
|
||||||
48
.github/ISSUE_TEMPLATE/feature_request.md
vendored
Normal file
48
.github/ISSUE_TEMPLATE/feature_request.md
vendored
Normal file
|
|
@ -0,0 +1,48 @@
|
||||||
|
---
|
||||||
|
name: Feature request
|
||||||
|
about: Suggest an enhancement or new feature for pdftract
|
||||||
|
title: '[FEATURE] '
|
||||||
|
labels: enhancement
|
||||||
|
assignees: ''
|
||||||
|
---
|
||||||
|
|
||||||
|
## Feature Description
|
||||||
|
|
||||||
|
A clear and concise description of the feature you'd like to see added.
|
||||||
|
|
||||||
|
## Use Case
|
||||||
|
|
||||||
|
Describe the specific problem this feature would solve. Who would benefit from this feature?
|
||||||
|
|
||||||
|
**Example:**
|
||||||
|
"As a user working with scientific papers, I need to extract tables as structured data so that I can analyze experimental results without manual transcription."
|
||||||
|
|
||||||
|
## Proposed Solution
|
||||||
|
|
||||||
|
How do you envision this feature working?
|
||||||
|
|
||||||
|
- **API:** What would the API look like?
|
||||||
|
- **CLI:** What flags or commands would be added?
|
||||||
|
- **Output format:** JSON, Markdown, CSV, etc.?
|
||||||
|
|
||||||
|
## Alternatives Considered
|
||||||
|
|
||||||
|
Describe any alternative solutions or workarounds you've considered. Why aren't they sufficient?
|
||||||
|
|
||||||
|
## Additional Context
|
||||||
|
|
||||||
|
Add any other context about the feature request here:
|
||||||
|
|
||||||
|
- Links to related issues or PRs
|
||||||
|
- References to similar features in other tools
|
||||||
|
- Example PDF files that demonstrate the need
|
||||||
|
- Draft API designs or pseudocode
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
**Note:** For help with development or contributing to pdftract, see [`CONTRIBUTING.md`](../../CONTRIBUTING.md).
|
||||||
|
|
||||||
|
- Links to related issues or PRs
|
||||||
|
- References to similar features in other tools
|
||||||
|
- Example PDF files that demonstrate the need
|
||||||
|
- Draft API designs or pseudocode
|
||||||
80
.github/ISSUE_TEMPLATE/performance_regression.md
vendored
Normal file
80
.github/ISSUE_TEMPLATE/performance_regression.md
vendored
Normal file
|
|
@ -0,0 +1,80 @@
|
||||||
|
---
|
||||||
|
name: Performance regression
|
||||||
|
about: Report a slowdown or performance issue
|
||||||
|
title: '[PERF] '
|
||||||
|
labels: performance
|
||||||
|
assignees: ''
|
||||||
|
---
|
||||||
|
|
||||||
|
## Performance Issue Description
|
||||||
|
|
||||||
|
A clear and concise description of the performance problem.
|
||||||
|
|
||||||
|
## Baseline vs Current Performance
|
||||||
|
|
||||||
|
**BEFORE (working well):**
|
||||||
|
- Version: (e.g., 0.5.0)
|
||||||
|
- Processing time: (e.g., 2.5 seconds for a 100-page PDF)
|
||||||
|
- Memory usage: (e.g., 150 MB peak)
|
||||||
|
|
||||||
|
**AFTER (regression):**
|
||||||
|
- Version: (e.g., 0.6.0)
|
||||||
|
- Processing time: (e.g., 8 seconds for the same PDF)
|
||||||
|
- Memory usage: (e.g., 600 MB peak)
|
||||||
|
|
||||||
|
## Test Case
|
||||||
|
|
||||||
|
Please provide:
|
||||||
|
1. **PDF file** (attach or link to a representative file)
|
||||||
|
2. **Command used:**
|
||||||
|
```bash
|
||||||
|
pdftract <command> <file>
|
||||||
|
```
|
||||||
|
3. **Benchmark results** (before and after):
|
||||||
|
```bash
|
||||||
|
# Use `hyperfine` or similar for accurate measurements
|
||||||
|
hyperfine 'pdftract old_version' 'pdftract new_version'
|
||||||
|
```
|
||||||
|
|
||||||
|
## Profiling Data (Optional but Helpful)
|
||||||
|
|
||||||
|
If available, attach profiling output:
|
||||||
|
```bash
|
||||||
|
# Flamegraph (Linux)
|
||||||
|
cargo install flamegraph
|
||||||
|
cargo flamegraph --bin pdftract -- <args>
|
||||||
|
|
||||||
|
# Instruments (macOS)
|
||||||
|
instruments -t "Time Profiler" cargo run --release -- <args>
|
||||||
|
|
||||||
|
# perf (Linux)
|
||||||
|
perf record -g cargo run --release -- <args>
|
||||||
|
perf report
|
||||||
|
```
|
||||||
|
|
||||||
|
## Environment
|
||||||
|
|
||||||
|
- **OS:** (e.g., Ubuntu 22.04, macOS 14, Windows 11)
|
||||||
|
- **Hardware:** (CPU, RAM - relevant for performance issues)
|
||||||
|
- **pdftract version:** (run `pdftract --version`)
|
||||||
|
- **Rust version:** (run `rustc --version`)
|
||||||
|
|
||||||
|
## Suspected Cause
|
||||||
|
|
||||||
|
If you have a hypothesis about what's causing the regression (e.g., a specific commit, a new dependency), please describe it here.
|
||||||
|
|
||||||
|
## Additional Context
|
||||||
|
|
||||||
|
Add any other context about the performance issue:
|
||||||
|
|
||||||
|
- Logs or traces
|
||||||
|
- Related issues or PRs
|
||||||
|
- Workarounds (e.g., using an older version)
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
**Note:** For help with development or contributing to pdftract, see [`CONTRIBUTING.md`](../../CONTRIBUTING.md).
|
||||||
|
|
||||||
|
- Logs or traces
|
||||||
|
- Related issues or PRs
|
||||||
|
- Workarounds (e.g., using an older version)
|
||||||
70
.github/PULL_REQUEST_TEMPLATE.md
vendored
Normal file
70
.github/PULL_REQUEST_TEMPLATE.md
vendored
Normal file
|
|
@ -0,0 +1,70 @@
|
||||||
|
# Pull Request
|
||||||
|
|
||||||
|
## Linked Issue or RFC
|
||||||
|
|
||||||
|
Closes #(issue number)
|
||||||
|
|
||||||
|
## Scope Statement
|
||||||
|
|
||||||
|
Which Phase / which Acceptance Scenario does this PR address?
|
||||||
|
|
||||||
|
- **Phase:** (e.g., Phase 2 - Font Encoding)
|
||||||
|
- **Acceptance Scenario:** (e.g., AS-2.3 - Embedded CMap with predefined CID->Unicode mapping)
|
||||||
|
|
||||||
|
## Summary
|
||||||
|
|
||||||
|
Brief description of what this PR does and why it's necessary.
|
||||||
|
|
||||||
|
## Changes Made
|
||||||
|
|
||||||
|
- List the main changes here
|
||||||
|
- Include file paths and key functions modified
|
||||||
|
- Note any breaking changes
|
||||||
|
|
||||||
|
## Test Plan
|
||||||
|
|
||||||
|
How did you verify this works?
|
||||||
|
|
||||||
|
- [ ] Unit tests pass (`cargo test --workspace --features default`)
|
||||||
|
- [ ] Integration tests pass
|
||||||
|
- [ ] Manual testing completed
|
||||||
|
|
||||||
|
### Test Evidence
|
||||||
|
|
||||||
|
Attach or paste:
|
||||||
|
- Terminal output from test runs
|
||||||
|
- Screenshots (for UI changes)
|
||||||
|
- Example PDF files processed (before/after)
|
||||||
|
|
||||||
|
## Performance Impact
|
||||||
|
|
||||||
|
If this PR touches hot-path code (parsing, text extraction, encoding resolution):
|
||||||
|
|
||||||
|
- [ ] No performance impact (CI changes, documentation, etc.)
|
||||||
|
- [ ] Performance improvement (include benchmarks)
|
||||||
|
- [ ] Performance regression (include justification)
|
||||||
|
|
||||||
|
### Benchmark Results (if applicable)
|
||||||
|
|
||||||
|
```text
|
||||||
|
(paste `cargo bench` output here)
|
||||||
|
```
|
||||||
|
|
||||||
|
## Checklist
|
||||||
|
|
||||||
|
- [ ] My code follows the style guidelines of this project (`cargo fmt`)
|
||||||
|
- [ ] I have performed a self-review of my code
|
||||||
|
- [ ] I have commented my code where necessary, particularly in hard-to-understand areas
|
||||||
|
- [ ] I have made corresponding changes to the documentation
|
||||||
|
- [ ] My changes generate no new warnings (`cargo clippy`)
|
||||||
|
- [ ] I have added tests that prove my fix is effective or that my feature works
|
||||||
|
- [ ] New and existing tests pass locally with `cargo test --workspace --features default`
|
||||||
|
- [ ] I have signed off my commits (`git commit -s`) per the DCO
|
||||||
|
|
||||||
|
## Additional Notes
|
||||||
|
|
||||||
|
Any additional context, screenshots, or considerations for the reviewer.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
**Note:** See [`CONTRIBUTING.md`](../../CONTRIBUTING.md) for development setup, local validation checklist, and commit message guidelines.
|
||||||
133
CODE_OF_CONDUCT.md
Normal file
133
CODE_OF_CONDUCT.md
Normal file
|
|
@ -0,0 +1,133 @@
|
||||||
|
# Contributor Covenant Code of Conduct
|
||||||
|
|
||||||
|
## Our Pledge
|
||||||
|
|
||||||
|
We as members, contributors, and leaders pledge to make participation in our
|
||||||
|
community a harassment-free experience for everyone, regardless of age, body
|
||||||
|
size, visible or invisible disability, ethnicity, sex characteristics, gender
|
||||||
|
identity and expression, level of experience, education, socio-economic status,
|
||||||
|
nationality, personal appearance, race, caste, color, religion, or sexual
|
||||||
|
identity and orientation.
|
||||||
|
|
||||||
|
We pledge to act and interact in ways that contribute to an open, welcoming,
|
||||||
|
diverse, inclusive, and healthy community.
|
||||||
|
|
||||||
|
## Our Standards
|
||||||
|
|
||||||
|
Examples of behavior that contributes to a positive environment for our
|
||||||
|
community include:
|
||||||
|
|
||||||
|
* Demonstrating empathy and kindness toward other people
|
||||||
|
* Being respectful of differing opinions, viewpoints, and experiences
|
||||||
|
* Giving and gracefully accepting constructive feedback
|
||||||
|
* Accepting responsibility and apologizing to those affected by our mistakes,
|
||||||
|
and learning from the experience
|
||||||
|
* Focusing on what is best not just for us as individuals, but for the overall
|
||||||
|
community
|
||||||
|
|
||||||
|
Examples of unacceptable behavior include:
|
||||||
|
|
||||||
|
* The use of sexualized language or imagery, and sexual attention or advances of
|
||||||
|
any kind
|
||||||
|
* Trolling, insulting or derogatory comments, and personal or political attacks
|
||||||
|
* Public or private harassment
|
||||||
|
* Publishing others' private information, such as a physical or email address,
|
||||||
|
without their explicit permission
|
||||||
|
* Other conduct which could reasonably be considered inappropriate in a
|
||||||
|
professional setting
|
||||||
|
|
||||||
|
## Enforcement Responsibilities
|
||||||
|
|
||||||
|
Community leaders are responsible for clarifying and enforcing our standards of
|
||||||
|
acceptable behavior and will take appropriate and fair corrective action in
|
||||||
|
response to any behavior that they deem inappropriate, threatening, offensive,
|
||||||
|
or harmful.
|
||||||
|
|
||||||
|
Community leaders have the right and responsibility to remove, edit, or reject
|
||||||
|
comments, commits, code, wiki edits, issues, and other contributions that are
|
||||||
|
not aligned to this Code of Conduct, and will communicate reasons for moderation
|
||||||
|
decisions when appropriate.
|
||||||
|
|
||||||
|
## Scope
|
||||||
|
|
||||||
|
This Code of Conduct applies within all community spaces, and also applies when
|
||||||
|
an individual is officially representing the community in public spaces.
|
||||||
|
Examples of representing our community include using an official e-mail address,
|
||||||
|
posting via an official social media account, or acting as an appointed
|
||||||
|
representative at an online or offline event.
|
||||||
|
|
||||||
|
## Enforcement
|
||||||
|
|
||||||
|
Instances of abusive, harassing, or otherwise unacceptable behavior may be
|
||||||
|
reported to the community leaders responsible for enforcement at
|
||||||
|
[security@jedarden.com](mailto:security@jedarden.com).
|
||||||
|
|
||||||
|
All complaints will be reviewed and investigated promptly and fairly.
|
||||||
|
|
||||||
|
All community leaders are obligated to respect the privacy and security of the
|
||||||
|
reporter of any incident.
|
||||||
|
|
||||||
|
## Enforcement Guidelines
|
||||||
|
|
||||||
|
Community leaders will follow these Community Impact Guidelines in determining
|
||||||
|
the consequences for any action they deem in violation of this Code of Conduct:
|
||||||
|
|
||||||
|
### 1. Correction
|
||||||
|
|
||||||
|
**Community Impact**: Use of inappropriate language or other behavior deemed
|
||||||
|
unprofessional or unwelcome in the community.
|
||||||
|
|
||||||
|
**Consequence**: A private, written warning from community leaders, providing
|
||||||
|
clarity around the nature of the violation and an explanation of why the
|
||||||
|
behavior was inappropriate. A public apology may be requested.
|
||||||
|
|
||||||
|
### 2. Warning
|
||||||
|
|
||||||
|
**Community Impact**: A violation through a single incident or series of
|
||||||
|
actions.
|
||||||
|
|
||||||
|
**Consequence**: A warning with consequences for continued behavior. No
|
||||||
|
interaction with the people involved, including unsolicited interaction with
|
||||||
|
those enforcing the Code of Conduct, for a specified period of time. This
|
||||||
|
includes avoiding interactions in community spaces as well as external channels
|
||||||
|
like social media. Violating these terms may lead to a temporary or permanent
|
||||||
|
ban.
|
||||||
|
|
||||||
|
### 3. Temporary Ban
|
||||||
|
|
||||||
|
**Community Impact**: A serious violation of community standards, including
|
||||||
|
sustained inappropriate behavior.
|
||||||
|
|
||||||
|
**Consequence**: A temporary ban from any sort of interaction or public
|
||||||
|
communication with the community for a specified period of time. No public or
|
||||||
|
private interaction with the people involved, including unsolicited interaction
|
||||||
|
with those enforcing the Code of Conduct, is allowed during this period.
|
||||||
|
Violating these terms may lead to a permanent ban.
|
||||||
|
|
||||||
|
### 4. Permanent Ban
|
||||||
|
|
||||||
|
**Community Impact**: Demonstrating a pattern of violation of community
|
||||||
|
standards, including sustained inappropriate behavior, harassment of an
|
||||||
|
individual, or aggression toward or disparagement of classes of individuals.
|
||||||
|
|
||||||
|
**Consequence**: A permanent ban from any sort of public interaction within the
|
||||||
|
community.
|
||||||
|
|
||||||
|
## Attribution
|
||||||
|
|
||||||
|
This Code of Conduct is adapted from the [Contributor Covenant][homepage],
|
||||||
|
version 2.1, available at
|
||||||
|
[https://www.contributor-covenant.org/version/2/1/code_of_conduct.html][v2.1].
|
||||||
|
|
||||||
|
Community Impact Guidelines were inspired by
|
||||||
|
[Mozilla's code of conduct enforcement ladder][mozilla coc].
|
||||||
|
|
||||||
|
For answers to common questions about this code of conduct, see the FAQ at
|
||||||
|
[https://www.contributor-covenant.org/faq][faq]. Translations are available at
|
||||||
|
[https://www.contributor-covenant.org/translations][translations].
|
||||||
|
|
||||||
|
[homepage]: https://www.contributor-covenant.org
|
||||||
|
[v2.1]: https://www.contributor-covenant.org/version/2/1/code_of_conduct.html
|
||||||
|
[mozilla coc]: https://github.com/mozilla/diversity
|
||||||
|
[faq]: https://www.contributor-covenant.org/faq
|
||||||
|
[translations]: https://www.contributor-covenant.org/translations
|
||||||
231
CONTRIBUTING.md
231
CONTRIBUTING.md
|
|
@ -2,6 +2,106 @@
|
||||||
|
|
||||||
Thank you for your interest in contributing to pdftract! This document covers the essential workflows for contributors.
|
Thank you for your interest in contributing to pdftract! This document covers the essential workflows for contributors.
|
||||||
|
|
||||||
|
## Licensing and Sign-off
|
||||||
|
|
||||||
|
pdftract is dual-licensed under **MIT OR Apache-2.0**. You may choose either license for your use.
|
||||||
|
|
||||||
|
### Developer Certificate of Origin (DCO)
|
||||||
|
|
||||||
|
This project requires a **Developer Certificate of Origin (DCO)** sign-off on all commits. This certifies that you wrote the code or have the right to pass it on as open-source.
|
||||||
|
|
||||||
|
**To sign your commits, use `git commit --signoff` (or `git commit -s`):**
|
||||||
|
|
||||||
|
```bash
|
||||||
|
git commit -s -m "feat: add some feature"
|
||||||
|
# The "Signed-off-by" trailer is added automatically
|
||||||
|
```
|
||||||
|
|
||||||
|
**No CLA is required.** The DCO is sufficient for this permissive-license project.
|
||||||
|
|
||||||
|
### Apache NOTICE File
|
||||||
|
|
||||||
|
The Apache-2.0 license includes a NOTICE file requirement, but pdftract does not ship a NOTICE file in the source distribution. This is intentional: the project maintains no contributor list outside of git history, and there are no third-party attribution notices required.
|
||||||
|
|
||||||
|
**Downstream redistributors MAY add a NOTICE file** when distributing pdftract as part of their own product. If you choose to add one, it should include:
|
||||||
|
- Attribution to the pdftract project
|
||||||
|
- A link to the original source repository
|
||||||
|
- Any modifications you made (if distributing a modified version)
|
||||||
|
|
||||||
|
The absence of a NOTICE file in the upstream distribution does not violate the Apache-2.0 license; the NOTICE requirement applies only when there is something to notice.
|
||||||
|
|
||||||
|
## Code of Conduct
|
||||||
|
|
||||||
|
This project adopts the [Contributor Covenant v2.1](CODE_OF_CONDUCT.md). All contributors are expected to uphold this code of conduct.
|
||||||
|
|
||||||
|
## Reporting Security Issues
|
||||||
|
|
||||||
|
If you discover a security vulnerability, please do **NOT** open a public issue or pull request. Instead, report it privately:
|
||||||
|
|
||||||
|
1. **Email (preferred):** [security@jedarden.com](mailto:security@jedarden.com)
|
||||||
|
- PGP-encrypted emails are strongly encouraged
|
||||||
|
- PGP key: [`docs/security/pgp-public-key.asc`](docs/security/pgp-public-key.asc)
|
||||||
|
|
||||||
|
2. **GitHub Private Vulnerability Reporting:**
|
||||||
|
- Use the [Security tab](https://github.com/jedarden/pdftract/security/advisories)
|
||||||
|
|
||||||
|
See [`SECURITY.md`](SECURITY.md) for our full disclosure policy, including:
|
||||||
|
- Supported versions and security fix timeline
|
||||||
|
- 90-day disclosure window
|
||||||
|
- CVE assignment process
|
||||||
|
- Safe harbor for good-faith researchers
|
||||||
|
|
||||||
|
## Development Setup
|
||||||
|
|
||||||
|
### Prerequisites
|
||||||
|
|
||||||
|
- **Rust 1.78 or later** — See [Minimum Supported Rust Version (MSRV)](#minimum-supported-rust-version-msrv) below
|
||||||
|
- **Git** — For cloning and committing
|
||||||
|
|
||||||
|
### OCR Feature Dependencies (Optional)
|
||||||
|
|
||||||
|
If you're developing OCR-related features (Phase 5), you'll need additional dependencies:
|
||||||
|
|
||||||
|
**Linux (Debian/Ubuntu):**
|
||||||
|
```bash
|
||||||
|
sudo apt-get install libleptonica-dev libtesseract-dev tesseract-ocr-eng
|
||||||
|
```
|
||||||
|
|
||||||
|
**macOS:**
|
||||||
|
```bash
|
||||||
|
brew install tesseract leptonica
|
||||||
|
```
|
||||||
|
|
||||||
|
**Windows:**
|
||||||
|
- Install Tesseract from the official installers: https://github.com/UB-Mannheim/tesseract/wiki
|
||||||
|
|
||||||
|
### Building
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# Clone your fork
|
||||||
|
git clone https://github.com/YOUR_USERNAME/pdftract.git
|
||||||
|
cd pdftract
|
||||||
|
|
||||||
|
# Build the workspace
|
||||||
|
cargo build --workspace --locked
|
||||||
|
|
||||||
|
# Build release binaries
|
||||||
|
cargo build --release --workspace
|
||||||
|
```
|
||||||
|
|
||||||
|
### Testing
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# Run all tests
|
||||||
|
cargo test --workspace --features default
|
||||||
|
|
||||||
|
# Run tests with output
|
||||||
|
cargo test --workspace --features default -- --nocapture
|
||||||
|
|
||||||
|
# Run a specific test
|
||||||
|
cargo test --workspace --features default test_name
|
||||||
|
```
|
||||||
|
|
||||||
## Minimum Supported Rust Version (MSRV)
|
## Minimum Supported Rust Version (MSRV)
|
||||||
|
|
||||||
The **Minimum Supported Rust Version (MSRV)** for pdftract is **1.78**. This is the oldest Rust version that can successfully build the project. The MSRV is declared in `Cargo.toml` via the `rust-version` field and enforced in CI.
|
The **Minimum Supported Rust Version (MSRV)** for pdftract is **1.78**. This is the oldest Rust version that can successfully build the project. The MSRV is declared in `Cargo.toml` via the `rust-version` field and enforced in CI.
|
||||||
|
|
@ -71,61 +171,122 @@ The Rust ecosystem convention is that library crates should not check in `Cargo.
|
||||||
- A single workspace-level `Cargo.lock` applies to all members.
|
- A single workspace-level `Cargo.lock` applies to all members.
|
||||||
- Downstream consumers can still ignore the lockfile by using `cargo build --frozen` with their own lockfile, or by vendoring.
|
- Downstream consumers can still ignore the lockfile by using `cargo build --frozen` with their own lockfile, or by vendoring.
|
||||||
|
|
||||||
## Development Workflow
|
## Local Validation Before Opening a PR
|
||||||
|
|
||||||
### Building
|
Before submitting a pull request, please run the following commands locally to ensure your changes pass all quality gates:
|
||||||
|
|
||||||
```bash
|
```bash
|
||||||
cargo build --release
|
# 1. Run all tests (must be all green)
|
||||||
```
|
cargo test --workspace --features default
|
||||||
|
|
||||||
### Testing
|
# 2. Lint with clippy (no warnings allowed)
|
||||||
|
cargo clippy --all-targets --features default -- -D warnings
|
||||||
|
|
||||||
```bash
|
# 3. Check binary size (must be within budget; target <= 4 MB stripped)
|
||||||
cargo test --all
|
cargo bloat --release --features default
|
||||||
```
|
|
||||||
|
|
||||||
### Linting
|
# 4. Check for security advisories (no medium+ issues)
|
||||||
|
cargo audit
|
||||||
|
|
||||||
```bash
|
# 5. Check license compliance (no rejected licenses)
|
||||||
cargo clippy --all-targets --all-features
|
cargo deny check licenses
|
||||||
|
|
||||||
|
# 6. Check code formatting
|
||||||
cargo fmt --check
|
cargo fmt --check
|
||||||
```
|
```
|
||||||
|
|
||||||
## Security
|
**Why these checks?** These exact commands are run in the `pdftract-ci` Argo workflow. A green local run predicts a green CI run, reducing review iteration cycles.
|
||||||
|
|
||||||
### Responsible Disclosure
|
### Binary Size Budget
|
||||||
|
|
||||||
If you discover a security vulnerability, please do **NOT** open a public issue or pull request. Instead, report it privately:
|
The release binary must be <= 4 MB when stripped. `cargo bloat` helps identify functions contributing most to binary size. If your PR adds significant code:
|
||||||
|
- Run `cargo bloat --release --features default --crates pdftract-cli`
|
||||||
|
- Check the top functions in the output
|
||||||
|
- Consider if large dependencies can be made optional or feature-gated
|
||||||
|
|
||||||
1. **Email (preferred):** [security@jedarden.com](mailto:security@jedarden.com)
|
## CI on Forks — The Argo-CI Caveat
|
||||||
- PGP-encrypted emails are strongly encouraged
|
|
||||||
- PGP key: [`docs/security/pgp-public-key.asc`](docs/security/pgp-public-key.asc)
|
|
||||||
|
|
||||||
2. **GitHub Private Vulnerability Reporting:**
|
> **IMPORTANT:** Because CI runs on the private `iad-ci` cluster, external contributors cannot trigger CI from their fork.
|
||||||
- Use the [Security tab](https://github.com/jedarden/pdftract/security/advisories)
|
|
||||||
|
|
||||||
See [`SECURITY.md`](SECURITY.md) for our full disclosure policy, including:
|
### How It Works
|
||||||
- Supported versions and security fix timeline
|
|
||||||
- 90-day disclosure window
|
|
||||||
- CVE assignment process
|
|
||||||
- Safe harbor for good-faith researchers
|
|
||||||
|
|
||||||
### Supply-Chain Security
|
1. **Fork and open a pull request** against `jedarden/pdftract:main`
|
||||||
|
2. **A maintainer will trigger the `pdftract-ci` Argo workflow** against your branch
|
||||||
|
3. **Results are posted as a PR comment** once the workflow completes
|
||||||
|
|
||||||
This project uses `cargo-audit` and `cargo-deny` for supply-chain security. New direct dependencies require an ADR or written justification in the PR description.
|
### Expected Response Time
|
||||||
|
|
||||||
## Licensing
|
- Maintainer-triggered CI: **within 48 hours**
|
||||||
|
- You'll receive a comment on your PR with the full CI log
|
||||||
|
|
||||||
pdftract is dual-licensed under **MIT OR Apache-2.0**. You may choose either license for your use.
|
### Why This Model?
|
||||||
|
|
||||||
### Apache NOTICE File
|
The `iad-ci` cluster is a private Rackspace Spot cluster accessed via kubectl-proxy over Tailscale. External forks do not have credentials to access this cluster, so they cannot self-serve CI runs. This is unusual, but it allows us to run CI on infrastructure we control without exposing cluster credentials publicly.
|
||||||
|
|
||||||
The Apache-2.0 license includes a NOTICE file requirement, but pdftract does not ship a NOTICE file in the source distribution. This is intentional: the project maintains no contributor list outside of git history, and there are no third-party attribution notices required.
|
### Local Validation is Critical
|
||||||
|
|
||||||
**Downstream redistributors MAY add a NOTICE file** when distributing pdftract as part of their own product. If you choose to add one, it should include:
|
Since you cannot trigger CI yourself, **please run the full local validation checklist** before opening your PR. This minimizes back-and-forth cycles when the maintainer-triggered CI fails.
|
||||||
- Attribution to the pdftract project
|
|
||||||
- A link to the original source repository
|
|
||||||
- Any modifications you made (if distributing a modified version)
|
|
||||||
|
|
||||||
The absence of a NOTICE file in the upstream distribution does not violate the Apache-2.0 license; the NOTICE requirement applies only when there is something to notice.
|
## Pull Request Template
|
||||||
|
|
||||||
|
All pull requests must follow the [PR template](.github/PULL_REQUEST_TEMPLATE.md). The template requires:
|
||||||
|
|
||||||
|
- **Linked issue or RFC** — Every PR should reference an issue or design document
|
||||||
|
- **Scope statement** — Which Phase / which Acceptance Scenario does this address?
|
||||||
|
- **Test plan** — How did you verify this works?
|
||||||
|
- **Manual-test evidence** — Screenshots, terminal output, or example runs
|
||||||
|
- **Performance impact** — If hot-path code was touched, include benchmark results
|
||||||
|
|
||||||
|
## Commit Message Style
|
||||||
|
|
||||||
|
This project uses **Conventional Commits** for commit messages. Release notes are auto-generated from commit history using `git-cliff`.
|
||||||
|
|
||||||
|
### Format
|
||||||
|
|
||||||
|
```
|
||||||
|
<type>(<scope>): <short summary>
|
||||||
|
|
||||||
|
[optional body]
|
||||||
|
|
||||||
|
[optional footer]
|
||||||
|
```
|
||||||
|
|
||||||
|
### Types
|
||||||
|
|
||||||
|
- `feat:` — A new feature
|
||||||
|
- `fix:` — A bug fix
|
||||||
|
- `perf:` — A performance improvement
|
||||||
|
- `docs:` — Documentation changes
|
||||||
|
- `chore:` — Maintenance tasks (updates, refactoring, tooling)
|
||||||
|
- `test:` — Test changes
|
||||||
|
- `BREAKING CHANGE:` — A breaking change (include in body or footer)
|
||||||
|
|
||||||
|
### Examples
|
||||||
|
|
||||||
|
```bash
|
||||||
|
feat(ocr): add Tesseract integration for phase 5
|
||||||
|
fix(font): handle missing /Widths in Type 3 fonts
|
||||||
|
perf(extract): cache page tree parsing results
|
||||||
|
docs(contributing): add Argo-CI caveat section
|
||||||
|
chore(deps): upgrade lodepng to 0.9.0
|
||||||
|
```
|
||||||
|
|
||||||
|
## Issue Triage
|
||||||
|
|
||||||
|
We use issue templates to ensure all necessary information is provided upfront. When opening an issue, please use the appropriate template:
|
||||||
|
|
||||||
|
- **Bug report** — Must include `pdftract doctor` output
|
||||||
|
- **Feature request** — Describe the use case and proposed API
|
||||||
|
- **Performance regression** — Include before/after benchmarks
|
||||||
|
- **Security advisory** — Redirects to private disclosure (see [Reporting Security Issues](#reporting-security-issues))
|
||||||
|
|
||||||
|
See [`.github/ISSUE_TEMPLATE/`](.github/ISSUE_TEMPLATE/) for the full list.
|
||||||
|
|
||||||
|
## Getting Help
|
||||||
|
|
||||||
|
- **Documentation:** Check [`docs/`](docs/) for design docs and ADRs
|
||||||
|
- **Issues:** Search existing issues before opening a new one
|
||||||
|
- **Discussions:** Use GitHub Discussions for questions and RFCs
|
||||||
|
- **Security:** See [SECURITY.md](SECURITY.md) for vulnerability reporting
|
||||||
|
|
||||||
|
Thank you for contributing to pdftract!
|
||||||
|
|
|
||||||
|
|
@ -147,6 +147,15 @@ For responsible disclosure of security vulnerabilities, please email [security@j
|
||||||
|
|
||||||
> **NOTE:** The PGP key is currently a placeholder. The security contact must generate and publish a 4096-bit RSA key for `security@jedarden.com`. See `docs/security/pgp-public-key.asc` for generation instructions.
|
> **NOTE:** The PGP key is currently a placeholder. The security contact must generate and publish a 4096-bit RSA key for `security@jedarden.com`. See `docs/security/pgp-public-key.asc` for generation instructions.
|
||||||
|
|
||||||
|
## Contributing
|
||||||
|
|
||||||
|
Contributions are welcome! Please see [`CONTRIBUTING.md`](CONTRIBUTING.md) for:
|
||||||
|
- Development setup and build instructions
|
||||||
|
- Local validation checklist before opening a PR
|
||||||
|
- Commit message style (Conventional Commits)
|
||||||
|
- CI on forks (maintainer-triggered Argo workflow)
|
||||||
|
- DCO sign-off requirement
|
||||||
|
|
||||||
## Status
|
## Status
|
||||||
|
|
||||||
Early development. See `docs/plan/` for the implementation roadmap.
|
Early development. See `docs/plan/` for the implementation roadmap.
|
||||||
|
|
|
||||||
66
notes/pdftract-i9rk.md
Normal file
66
notes/pdftract-i9rk.md
Normal file
|
|
@ -0,0 +1,66 @@
|
||||||
|
# Verification Note: pdftract-i9rk
|
||||||
|
|
||||||
|
## Bead
|
||||||
|
CONTRIBUTING.md — Argo-CI caveat for forks, local validation checklist, PR template requirements
|
||||||
|
|
||||||
|
## Summary
|
||||||
|
Created and updated contributor documentation to ensure first-time contributors can submit properly-formatted PRs without surprises.
|
||||||
|
|
||||||
|
## Files Created/Modified
|
||||||
|
|
||||||
|
### Created:
|
||||||
|
1. **CODE_OF_CONDUCT.md** — Contributor Covenant v2.1 (5,519 bytes)
|
||||||
|
2. **.github/PULL_REQUEST_TEMPLATE.md** — PR template with required fields (1,988 bytes)
|
||||||
|
3. **.github/ISSUE_TEMPLATE/bug_report.md** — Bug report template (1,532 bytes)
|
||||||
|
4. **.github/ISSUE_TEMPLATE/feature_request.md** — Feature request template (1,373 bytes)
|
||||||
|
5. **.github/ISSUE_TEMPLATE/performance_regression.md** — Performance regression template (1,974 bytes)
|
||||||
|
|
||||||
|
### Modified:
|
||||||
|
1. **CONTRIBUTING.md** — Completely restructured with all required sections (11,294 bytes)
|
||||||
|
2. **README.md** — Added Contributing section with link to CONTRIBUTING.md
|
||||||
|
|
||||||
|
## Acceptance Criteria Status
|
||||||
|
|
||||||
|
### PASS
|
||||||
|
- [x] CONTRIBUTING.md exists at repo root
|
||||||
|
- [x] All nine sections from bead description are present:
|
||||||
|
1. Project licensing (dual MIT OR Apache-2.0)
|
||||||
|
2. Code of conduct (link to CODE_OF_CONDUCT.md)
|
||||||
|
3. Reporting security issues (link to SECURITY.md)
|
||||||
|
4. Development setup (with OCR dependencies for Phase 5 features)
|
||||||
|
5. Local validation expected before opening a PR (6 commands matching pdftract-ci)
|
||||||
|
6. CI on forks (the Argo-CI caveat)
|
||||||
|
7. PR template requirements
|
||||||
|
8. Commit message style (Conventional Commits)
|
||||||
|
9. Issue triage
|
||||||
|
- [x] The Argo-CI caveat is in a clearly visible Markdown blockquote (`> **IMPORTANT:**`)
|
||||||
|
- [x] Local-validation commands exactly match the pdftract-ci workflow steps
|
||||||
|
- [x] A first-time contributor can read CONTRIBUTING.md and submit a properly-formatted PR without surprises
|
||||||
|
- [x] Linked from README (Contributing section added)
|
||||||
|
- [x] Linked from .github/ISSUE_TEMPLATE/ (all templates have footer links)
|
||||||
|
- [x] Linked from PR template (footer link added)
|
||||||
|
|
||||||
|
### Key Features Implemented
|
||||||
|
- DCO sign-off requirement clearly documented with `git commit -s` example
|
||||||
|
- 48-hour maintainer-triggered CI response window documented
|
||||||
|
- Argo-CI caveat explains why external forks cannot self-trigger CI
|
||||||
|
- Local validation checklist matches CI workflow: test, clippy, bloat, audit, deny, fmt
|
||||||
|
- Conventional Commits format documented with examples
|
||||||
|
- All issue templates include link to CONTRIBUTING.md
|
||||||
|
|
||||||
|
## Files Modified
|
||||||
|
- CONTRIBUTING.md (restructured)
|
||||||
|
- README.md (added Contributing section)
|
||||||
|
- CODE_OF_CONDUCT.md (created)
|
||||||
|
- .github/PULL_REQUEST_TEMPLATE.md (created)
|
||||||
|
- .github/ISSUE_TEMPLATE/bug_report.md (created)
|
||||||
|
- .github/ISSUE_TEMPLATE/feature_request.md (created)
|
||||||
|
- .github/ISSUE_TEMPLATE/performance_regression.md (created)
|
||||||
|
|
||||||
|
## No WARN/FAIL
|
||||||
|
All acceptance criteria met.
|
||||||
|
|
||||||
|
## References
|
||||||
|
- Plan section: Release Engineering / Contributor Workflow, lines 3424-3433
|
||||||
|
- ADR-009 (Argo-only CI explains the fork caveat)
|
||||||
|
- Bead: pdftract-i9rk
|
||||||
Loading…
Add table
Reference in a new issue