docs(pdftract-1e5ud): add SDK conformance test documentation
Add documentation for the SDK conformance test suite in CONTRIBUTING.md and crates/pdftract-core/README.md, including: - How to run the conformance tests - All 9 SDK contract methods covered - Feature-gated test behavior - How to add new test cases Signed-off-by: jedarden <github@jedarden.com>
This commit is contained in:
parent
c263189361
commit
46632a3c6c
2 changed files with 67 additions and 0 deletions
|
|
@ -102,6 +102,44 @@ cargo test --workspace --features default -- --nocapture
|
|||
cargo test --workspace --features default test_name
|
||||
```
|
||||
|
||||
### SDK Conformance Tests
|
||||
|
||||
pdftract includes a shared SDK conformance suite that validates the public API contract across all SDK implementations (Python, Node, Go, Java, .NET, and Rust). The Rust SDK conformance tests run directly against `pdftract-core` to ensure the library's public API satisfies the documented SDK contract.
|
||||
|
||||
```bash
|
||||
# Run the conformance suite
|
||||
cargo test -p pdftract-core --test conformance
|
||||
|
||||
# Run with specific features
|
||||
cargo test -p pdftract-core --test conformance --features ocr,profiles,remote,receipts
|
||||
```
|
||||
|
||||
The conformance suite is defined in `tests/sdk-conformance/cases.json` and covers all 9 SDK contract methods:
|
||||
- `extract` — Full extraction with structured output
|
||||
- `extract_text` — Plain text extraction
|
||||
- `extract_markdown` — Markdown-formatted extraction
|
||||
- `extract_stream` — Streaming NDJSON extraction
|
||||
- `search` — Pattern search (literal and regex)
|
||||
- `get_metadata` — PDF metadata extraction
|
||||
- `hash` — Content fingerprinting (SHA256)
|
||||
- `classify` — Document classification
|
||||
- `verify_receipt` — Receipt verification
|
||||
|
||||
Each test case includes:
|
||||
- **fixture** — Input PDF path or URL
|
||||
- **method** — Which SDK method to invoke
|
||||
- **options** — Method-specific options (OCR, password, etc.)
|
||||
- **expected** — Expected results with numeric tolerances
|
||||
- **tolerances** — Per-field numeric comparison tolerances
|
||||
- **feature** — Required feature flag (for conditional compilation)
|
||||
|
||||
Feature-gated tests skip automatically when the corresponding feature is not compiled:
|
||||
- `ocr` — OCR-based extraction
|
||||
- `decrypt` — Password-protected PDFs
|
||||
- `profiles` — Document classification
|
||||
- `receipts` — Receipt verification
|
||||
- `remote` — URL-based remote fetch
|
||||
|
||||
## Minimum Supported Rust Version (MSRV)
|
||||
|
||||
The **Minimum Supported Rust Version (MSRV)** for pdftract is **1.78**. This is the oldest Rust version that can successfully build the project. The MSRV is declared in `Cargo.toml` via the `rust-version` field and enforced in CI.
|
||||
|
|
|
|||
|
|
@ -26,6 +26,35 @@ The tradeoff—occasional merge conflicts when PRs update overlapping dependenci
|
|||
- `extract`: Text extraction with provenance (bounding boxes, confidence scores)
|
||||
- `ocr`: Tesseract integration for raster pages
|
||||
|
||||
## Testing
|
||||
|
||||
### SDK Conformance Tests
|
||||
|
||||
The `conformance` integration test validates that `pdftract-core`'s public API satisfies the SDK contract shared across all language implementations. The test rig runs shared conformance cases from `tests/sdk-conformance/cases.json` and verifies correct behavior for all 9 SDK contract methods.
|
||||
|
||||
```bash
|
||||
# Run the conformance suite
|
||||
cargo test --test conformance
|
||||
|
||||
# Run with specific features
|
||||
cargo test --test conformance --features ocr,profiles,remote,receipts
|
||||
```
|
||||
|
||||
The conformance suite covers:
|
||||
- `extract` — Full extraction with structured Document output
|
||||
- `extract_text` — Plain text extraction
|
||||
- `extract_markdown` — Markdown-formatted extraction with tables and headings
|
||||
- `extract_stream` — Streaming NDJSON extraction for large documents
|
||||
- `search` — Pattern search with regex and case-insensitive options
|
||||
- `get_metadata` — PDF metadata (page count, title, author, creator)
|
||||
- `hash` — Content fingerprinting (SHA256) with fast hash variant
|
||||
- `classify` — Document classification with category and confidence
|
||||
- `verify_receipt` — Receipt verification against signed metadata
|
||||
|
||||
Each test case validates expected results with numeric tolerances for bounding boxes and confidence scores. Feature-gated tests (OCR, decryption, classification, receipts, remote) skip automatically when the corresponding feature is not compiled.
|
||||
|
||||
See `CONTRIBUTING.md` for more details on the conformance suite and adding new test cases.
|
||||
|
||||
## Usage
|
||||
|
||||
```rust
|
||||
|
|
|
|||
Loading…
Add table
Reference in a new issue