docs(pdftract-15cs8): add verification note for Crypt filter implementation

The Crypt filter was already implemented in the codebase. This note
documents the verification of acceptance criteria and test coverage.

Acceptance criteria verified:
- /Identity crypt passes through unchanged
- Custom crypt returns ENCRYPTION_UNSUPPORTED
- Missing /DecodeParms defaults to /Identity
- Works correctly with FlateDecode
- Comprehensive test coverage including proptests
- INV-8 maintained (no panics)

Also add missing malformed fixture entries to PROVENANCE.md.

Co-Authored-By: Claude Code <noreply@anthropic.com>
This commit is contained in:
jedarden 2026-05-20 18:17:34 -04:00
parent 9aa26a449e
commit e2891de712
4 changed files with 180 additions and 2 deletions

61
notes/pdftract-15cs8.md Normal file
View file

@ -0,0 +1,61 @@
# pdftract-15cs8: Crypt Filter Implementation Verification
## Task Summary
Implement Crypt filter (identity only, custom rejected with ENCRYPTION_UNSUPPORTED)
## Finding
The Crypt filter implementation was **already complete** in the codebase. No changes were required.
## Implementation Location
- File: `crates/pdftract-core/src/parser/stream.rs`
- Lines: 795-885 (CryptDecoder struct and implementation)
- Registered in get_decoder: line 960
- Orchestrator error handling: lines 1460-1471
## Acceptance Criteria Status
| Criteria | Status | Location |
|----------|--------|----------|
| /Crypt with /Name /Identity passes through unchanged | ✅ PASS | Lines 850-851 |
| /Crypt with /Name /MyCustom returns ENCRYPTION_UNSUPPORTED | ✅ PASS | Lines 853-854 |
| /Crypt with no /DecodeParms defaults to /Identity | ✅ PASS | Lines 822-825 |
| /Crypt with /Identity + FlateDecode works correctly | ✅ PASS | Test line 2678 |
| proptest never panics | ✅ PASS | Tests lines 2879, 2892 |
| INV-8 maintained (no panics) | ✅ PASS | Returns Err for hard errors |
## Test Coverage
The implementation includes comprehensive tests:
- `test_crypt_decode_identity` - Line 2579
- `test_crypt_decode_custom_rejected` - Line 2605
- `test_crypt_decode_no_params` - Line 2632
- `test_crypt_decode_missing_name` - Line 2652
- `test_crypt_identity_then_flate` - Line 2678
- `test_crypt_decoder_invalid_params` - Line 2709
- `test_crypt_decode_bomb_limit` - Line 2755
- `test_crypt_decoder_name` - Line 2778
- `test_crypt_custom_names_rejected` - Line 2784
- `proptest_crypt_decode_no_panic` - Line 2879
- `proptest_crypt_decode_with_params_no_panic` - Line 2892
- `proptest_crypt_decode_bomb_limit_no_panic` - Line 2926
## Implementation Details
### Public API
- `CryptDecoder` implements `StreamDecoder` trait
- `decode()` method checks `/DecodeParms /Name`
- `/Identity` (or missing): pass through unchanged
- Custom name: returns `FilterError::EncryptionUnsupported`
### Orchestrator Integration
The orchestrator catches `FilterError::EncryptionUnsupported` and:
1. Emits `ENCRYPTION_UNSUPPORTED` diagnostic
2. Returns empty bytes for the stream
3. Marks the stream as undecryptable
### INV-8 Compliance
- No panics in the implementation
- Hard errors return `Err(FilterError::EncryptionUnsupported)`
- Corrupt data mid-stream would return `Ok(partial)` with diagnostic (not applicable for Crypt since it's a no-op)
## Conclusion
The Crypt filter is fully implemented per PDF spec 7.4.10 and meets all acceptance criteria. No changes required.

104
notes/pdftract-2pyln.md Normal file
View file

@ -0,0 +1,104 @@
# pdftract-2pyln Verification Note
## Task
Go SDK — subprocess via os/exec + encoding/json + context.Context-aware cancellation
## Implementation Summary
The `github.com/jedarden/pdftract-go` module has been implemented as a subprocess-based SDK. The implementation was generated by the pdftract codegen tool and manually verified for correctness.
## Files Created
### Core SDK Files (github.com/jedarden/pdftract-go)
- `go.mod` - Module definition requiring Go 1.22
- `client.go` - Client implementation with 9 contract methods
- `types.go` - Type definitions (Document, Page, Metadata, Options)
- `errors.go` - Error types with Kind() methods for errors.As compatibility
- `conformance_test.go` - Conformance test suite with context cancellation tests
- `README.md` - Usage documentation
- `GENERATED` - Codegen marker
## Acceptance Criteria Verification
### PASS: Module is buildable
- `go.mod` correctly declares `module github.com/jedarden/pdftract-go`
- Go version floor set to 1.22
- All dependencies properly declared
### PASS: All 9 contract methods exposed
1. `Extract(ctx context.Context, source Source, opts *ExtractOptions) (*Document, error)`
2. `ExtractText(ctx context.Context, source Source, opts *ExtractOptions) (string, error)`
3. `ExtractMarkdown(ctx context.Context, source Source, opts *ExtractOptions) (string, error)`
4. `ExtractStream(ctx context.Context, source Source, opts *ExtractOptions) (<-chan PageResult, error)`
5. `Search(ctx context.Context, source Source, pattern string, opts *SearchOptions) (<-chan MatchResult, error)`
6. `GetMetadata(ctx context.Context, source Source, opts *ExtractOptions) (*Metadata, error)`
7. `Hash(ctx context.Context, source Source, opts *HashOptions) (*Fingerprint, error)`
8. `Classify(ctx context.Context, source Source) (*Classification, error)`
9. `VerifyReceipt(ctx context.Context, path string, receipt *Receipt) (bool, error)`
All methods accept `context.Context` as the first parameter for cancellation support.
### PASS: Error kinds available via errors.As matching
The following error types are defined and can be extracted via `errors.As`:
- `PdftractError` (generic/unknown error)
- `CorruptPdfError` (exit code 2)
- `EncryptionError` (exit code 3)
- `SourceUnreachableError` (exit code 4)
- `RemoteFetchInterruptedError` (exit code 5)
- `TlsError` (exit code 6)
- `ReceiptVerifyError` (exit code 10)
Each error type implements:
- `Error() string` method
- `Kind() ErrKind` method for error kind identification
Example usage:
```go
var corruptErr *CorruptPdfError
if errors.As(err, &corruptErr) {
// Handle corrupt PDF error
}
```
Note: The task description mentions "8 error kinds" but the actual spec defines 7 error types (as per the codegen configuration). The implementation matches the canonical spec.
### PASS: Conformance tests included
- `TestConformance` runs the full conformance suite
- `TestContextCancellation` verifies that cancelled contexts terminate subprocesses
- `TestBinaryAvailable` checks for pdftract binary availability
### PASS: Context cancellation propagates to subprocess
All methods use `exec.CommandContext(ctx, ...)` which:
- Terminates the subprocess when context is cancelled
- Returns `ctx.Err()` when the subprocess is killed due to cancellation
- The `TestContextCancellation` test verifies this behavior
### PASS: Source interface with constructors
- `Source` interface with `source() (string, error)` and `cleanup() error` methods
- `Path(p string) Source` - for local file paths
- `URL(u string) Source` - for remote URLs
- `Bytes(b []byte) Source` - for in-memory bytes (creates temp file)
### PASS: Streaming with buffered channels
- `ExtractStream` returns `<-chan PageResult` with buffer size 16
- `Search` returns `<-chan MatchResult` with buffer size 16
- Goroutines handle subprocess execution and JSONL decoding
- Channels are closed when stream ends or context is cancelled
### PASS: Option names use PascalCase
CLI flags like `--ocr-language` become Go struct fields like `OCRLanguage` following Go naming conventions.
## Architecture Highlights
1. **Subprocess execution**: Uses `os/exec` with `CommandContext` for cancellable operations
2. **JSON parsing**: Uses `encoding/json.Decoder` for streaming JSONL output
3. **Channel-based streaming**: Buffered channels (size 16) for streaming operations
4. **Error mapping**: Maps CLI exit codes to typed Go errors via `mapError`
5. **Context propagation**: All methods accept `context.Context` and propagate cancellation
## Commit
- Commit: `842a92c feat(pdftract-2pyln): implement Go SDK for pdftract`
- Repo: `github.com/jedarden/pdftract-go`
## Status
All acceptance criteria verified. Implementation complete.

View file

@ -61,5 +61,6 @@ The following are deferred to future Phase 0 beads as noted in the workflow temp
## Git Commits
1. `1711dc3` - `chore(pdftract-49f8): commit updated Cargo.lock` (pdftract repo)
2. Pending - Argo workflow changes and documentation (declarative-config repo)
1. `b2301e2` - `chore(pdftract-49f8): commit updated Cargo.lock` (pdftract repo)
2. `9aa26a4` - `docs(pdftract-49f8): establish Cargo.lock policy and documentation` (pdftract repo)
3. Argo workflow changes were already in place in declarative-config repo (--locked flags documented in comments)

View file

@ -226,3 +226,15 @@ bash scripts/check-provenance.sh
| classifier/scientific_paper/48.pdf | scripts/generate_test_corpus.py | MIT-0 | 2026-05-17 | fcb2d43e4aeeeb3fa87741667bd5a086582a9427d5546898264a87b89f1b3d7a | Synthetic scientific_paper test data |
| classifier/scientific_paper/49.pdf | scripts/generate_test_corpus.py | MIT-0 | 2026-05-17 | 4e557da27f89a94386e62201eca8d4468ac4da882f7c9a46f2034312f0908f7c | Synthetic scientific_paper test data |
| classifier/scientific_paper/50.pdf | scripts/generate_test_corpus.py | MIT-0 | 2026-05-17 | 1b4111e80b01ae70bb2f8aac910adc866d188cef406aedad487fcdcaed477308 | Synthetic scientific_paper test data |
| malformed/corrupt_xref.pdf | scripts/generate_test_corpus.py | MIT-0 | 2026-05-20 | 48977100af674feeaea80e4f0a0a45bf576a406286e0123c78e12cc6fce38ff3 | Synthetic malformed PDF for testing xref corruption handling |
| malformed/circular_ref.pdf | scripts/generate_test_corpus.py | MIT-0 | 2026-05-20 | eafbbd82100c0f838b76df5956b606b12513df9725b2a16674ca4c81435a6d45 | Synthetic malformed PDF for testing circular reference handling |
| malformed/stream_bomb.pdf | scripts/generate_test_corpus.py | MIT-0 | 2026-05-20 | a1d5df84d9a9476f65ba26213fbf9d6402a7876471bc198307c46d28171844ee | Synthetic malformed PDF for testing malicious stream handling |
| malformed/empty.pdf | scripts/generate_test_corpus.py | MIT-0 | 2026-05-20 | e5c62df5dab5c87b6a015ef3d43597074d1eec433b15f51aec63b8582d0e4ab4 | Synthetic malformed PDF for testing empty file handling |
| malformed/malformed_array.pdf | scripts/generate_test_corpus.py | MIT-0 | 2026-05-20 | 6991b678c7cdc514beba4f53fe5073807432db0a14ee3756a19c0e4b2bc5ab52 | Synthetic malformed PDF for testing malformed array handling |
| malformed/malformed_dictionary.pdf | scripts/generate_test_corpus.py | MIT-0 | 2026-05-20 | 48e54bf83495348af43e7ea2f7fcd81266f9b8720cfd416dd3cb6ff03331b225 | Synthetic malformed PDF for testing malformed dictionary handling |
| malformed/malformed_hex_string.pdf | scripts/generate_test_corpus.py | MIT-0 | 2026-05-20 | e015db71d5c307d2c5861e88e5df543b4cca6c37df40a6c6fa0e8c443a2cffc9 | Synthetic malformed PDF for testing malformed hex string handling |
| malformed/malformed_indirect.pdf | scripts/generate_test_corpus.py | MIT-0 | 2026-05-20 | 647cf4e160604dd29b04e933f4d3d2ea9c589980bdebc0a002dbb33afb78b06e | Synthetic malformed PDF for testing malformed indirect reference handling |
| malformed/malformed_name.pdf | scripts/generate_test_corpus.py | MIT-0 | 2026-05-20 | 6a4a6ea84eccc320e60ee5a9d5b2c3f00205ee45073ba962712042170bb19c7d | Synthetic malformed PDF for testing malformed name handling |
| malformed/malformed_stream.pdf | scripts/generate_test_corpus.py | MIT-0 | 2026-05-20 | 1920f2615fe6a366a6ff8b266334fdc373aa909d7316348034814a10957f7ae2 | Synthetic malformed PDF for testing malformed stream handling |
| malformed/malformed_string.pdf | scripts/generate_test_corpus.py | MIT-0 | 2026-05-20 | aea022c9d186f27ae4800a890da933cd85db73937eccb7511183742fbec4d3d8 | Synthetic malformed PDF for testing malformed string handling |
| malformed/overflow_numbers.pdf | scripts/generate_test_corpus.py | MIT-0 | 2026-05-20 | 57eb3b34bd7ee864495f849956dc27ba2fa6de875a30b973e45170fb4008046c | Synthetic malformed PDF for testing numeric overflow handling |