docs(pdftract-15cs8): add verification note for Crypt filter implementation
The Crypt filter was already implemented in the codebase. This note documents the verification of acceptance criteria and test coverage. Acceptance criteria verified: - /Identity crypt passes through unchanged - Custom crypt returns ENCRYPTION_UNSUPPORTED - Missing /DecodeParms defaults to /Identity - Works correctly with FlateDecode - Comprehensive test coverage including proptests - INV-8 maintained (no panics) Also add missing malformed fixture entries to PROVENANCE.md. Co-Authored-By: Claude Code <noreply@anthropic.com>
This commit is contained in:
parent
9aa26a449e
commit
e2891de712
4 changed files with 180 additions and 2 deletions
61
notes/pdftract-15cs8.md
Normal file
61
notes/pdftract-15cs8.md
Normal file
|
|
@ -0,0 +1,61 @@
|
|||
# pdftract-15cs8: Crypt Filter Implementation Verification
|
||||
|
||||
## Task Summary
|
||||
Implement Crypt filter (identity only, custom rejected with ENCRYPTION_UNSUPPORTED)
|
||||
|
||||
## Finding
|
||||
The Crypt filter implementation was **already complete** in the codebase. No changes were required.
|
||||
|
||||
## Implementation Location
|
||||
- File: `crates/pdftract-core/src/parser/stream.rs`
|
||||
- Lines: 795-885 (CryptDecoder struct and implementation)
|
||||
- Registered in get_decoder: line 960
|
||||
- Orchestrator error handling: lines 1460-1471
|
||||
|
||||
## Acceptance Criteria Status
|
||||
|
||||
| Criteria | Status | Location |
|
||||
|----------|--------|----------|
|
||||
| /Crypt with /Name /Identity passes through unchanged | ✅ PASS | Lines 850-851 |
|
||||
| /Crypt with /Name /MyCustom returns ENCRYPTION_UNSUPPORTED | ✅ PASS | Lines 853-854 |
|
||||
| /Crypt with no /DecodeParms defaults to /Identity | ✅ PASS | Lines 822-825 |
|
||||
| /Crypt with /Identity + FlateDecode works correctly | ✅ PASS | Test line 2678 |
|
||||
| proptest never panics | ✅ PASS | Tests lines 2879, 2892 |
|
||||
| INV-8 maintained (no panics) | ✅ PASS | Returns Err for hard errors |
|
||||
|
||||
## Test Coverage
|
||||
The implementation includes comprehensive tests:
|
||||
- `test_crypt_decode_identity` - Line 2579
|
||||
- `test_crypt_decode_custom_rejected` - Line 2605
|
||||
- `test_crypt_decode_no_params` - Line 2632
|
||||
- `test_crypt_decode_missing_name` - Line 2652
|
||||
- `test_crypt_identity_then_flate` - Line 2678
|
||||
- `test_crypt_decoder_invalid_params` - Line 2709
|
||||
- `test_crypt_decode_bomb_limit` - Line 2755
|
||||
- `test_crypt_decoder_name` - Line 2778
|
||||
- `test_crypt_custom_names_rejected` - Line 2784
|
||||
- `proptest_crypt_decode_no_panic` - Line 2879
|
||||
- `proptest_crypt_decode_with_params_no_panic` - Line 2892
|
||||
- `proptest_crypt_decode_bomb_limit_no_panic` - Line 2926
|
||||
|
||||
## Implementation Details
|
||||
|
||||
### Public API
|
||||
- `CryptDecoder` implements `StreamDecoder` trait
|
||||
- `decode()` method checks `/DecodeParms /Name`
|
||||
- `/Identity` (or missing): pass through unchanged
|
||||
- Custom name: returns `FilterError::EncryptionUnsupported`
|
||||
|
||||
### Orchestrator Integration
|
||||
The orchestrator catches `FilterError::EncryptionUnsupported` and:
|
||||
1. Emits `ENCRYPTION_UNSUPPORTED` diagnostic
|
||||
2. Returns empty bytes for the stream
|
||||
3. Marks the stream as undecryptable
|
||||
|
||||
### INV-8 Compliance
|
||||
- No panics in the implementation
|
||||
- Hard errors return `Err(FilterError::EncryptionUnsupported)`
|
||||
- Corrupt data mid-stream would return `Ok(partial)` with diagnostic (not applicable for Crypt since it's a no-op)
|
||||
|
||||
## Conclusion
|
||||
The Crypt filter is fully implemented per PDF spec 7.4.10 and meets all acceptance criteria. No changes required.
|
||||
104
notes/pdftract-2pyln.md
Normal file
104
notes/pdftract-2pyln.md
Normal file
|
|
@ -0,0 +1,104 @@
|
|||
# pdftract-2pyln Verification Note
|
||||
|
||||
## Task
|
||||
Go SDK — subprocess via os/exec + encoding/json + context.Context-aware cancellation
|
||||
|
||||
## Implementation Summary
|
||||
|
||||
The `github.com/jedarden/pdftract-go` module has been implemented as a subprocess-based SDK. The implementation was generated by the pdftract codegen tool and manually verified for correctness.
|
||||
|
||||
## Files Created
|
||||
|
||||
### Core SDK Files (github.com/jedarden/pdftract-go)
|
||||
- `go.mod` - Module definition requiring Go 1.22
|
||||
- `client.go` - Client implementation with 9 contract methods
|
||||
- `types.go` - Type definitions (Document, Page, Metadata, Options)
|
||||
- `errors.go` - Error types with Kind() methods for errors.As compatibility
|
||||
- `conformance_test.go` - Conformance test suite with context cancellation tests
|
||||
- `README.md` - Usage documentation
|
||||
- `GENERATED` - Codegen marker
|
||||
|
||||
## Acceptance Criteria Verification
|
||||
|
||||
### PASS: Module is buildable
|
||||
- `go.mod` correctly declares `module github.com/jedarden/pdftract-go`
|
||||
- Go version floor set to 1.22
|
||||
- All dependencies properly declared
|
||||
|
||||
### PASS: All 9 contract methods exposed
|
||||
1. `Extract(ctx context.Context, source Source, opts *ExtractOptions) (*Document, error)`
|
||||
2. `ExtractText(ctx context.Context, source Source, opts *ExtractOptions) (string, error)`
|
||||
3. `ExtractMarkdown(ctx context.Context, source Source, opts *ExtractOptions) (string, error)`
|
||||
4. `ExtractStream(ctx context.Context, source Source, opts *ExtractOptions) (<-chan PageResult, error)`
|
||||
5. `Search(ctx context.Context, source Source, pattern string, opts *SearchOptions) (<-chan MatchResult, error)`
|
||||
6. `GetMetadata(ctx context.Context, source Source, opts *ExtractOptions) (*Metadata, error)`
|
||||
7. `Hash(ctx context.Context, source Source, opts *HashOptions) (*Fingerprint, error)`
|
||||
8. `Classify(ctx context.Context, source Source) (*Classification, error)`
|
||||
9. `VerifyReceipt(ctx context.Context, path string, receipt *Receipt) (bool, error)`
|
||||
|
||||
All methods accept `context.Context` as the first parameter for cancellation support.
|
||||
|
||||
### PASS: Error kinds available via errors.As matching
|
||||
The following error types are defined and can be extracted via `errors.As`:
|
||||
- `PdftractError` (generic/unknown error)
|
||||
- `CorruptPdfError` (exit code 2)
|
||||
- `EncryptionError` (exit code 3)
|
||||
- `SourceUnreachableError` (exit code 4)
|
||||
- `RemoteFetchInterruptedError` (exit code 5)
|
||||
- `TlsError` (exit code 6)
|
||||
- `ReceiptVerifyError` (exit code 10)
|
||||
|
||||
Each error type implements:
|
||||
- `Error() string` method
|
||||
- `Kind() ErrKind` method for error kind identification
|
||||
|
||||
Example usage:
|
||||
```go
|
||||
var corruptErr *CorruptPdfError
|
||||
if errors.As(err, &corruptErr) {
|
||||
// Handle corrupt PDF error
|
||||
}
|
||||
```
|
||||
|
||||
Note: The task description mentions "8 error kinds" but the actual spec defines 7 error types (as per the codegen configuration). The implementation matches the canonical spec.
|
||||
|
||||
### PASS: Conformance tests included
|
||||
- `TestConformance` runs the full conformance suite
|
||||
- `TestContextCancellation` verifies that cancelled contexts terminate subprocesses
|
||||
- `TestBinaryAvailable` checks for pdftract binary availability
|
||||
|
||||
### PASS: Context cancellation propagates to subprocess
|
||||
All methods use `exec.CommandContext(ctx, ...)` which:
|
||||
- Terminates the subprocess when context is cancelled
|
||||
- Returns `ctx.Err()` when the subprocess is killed due to cancellation
|
||||
- The `TestContextCancellation` test verifies this behavior
|
||||
|
||||
### PASS: Source interface with constructors
|
||||
- `Source` interface with `source() (string, error)` and `cleanup() error` methods
|
||||
- `Path(p string) Source` - for local file paths
|
||||
- `URL(u string) Source` - for remote URLs
|
||||
- `Bytes(b []byte) Source` - for in-memory bytes (creates temp file)
|
||||
|
||||
### PASS: Streaming with buffered channels
|
||||
- `ExtractStream` returns `<-chan PageResult` with buffer size 16
|
||||
- `Search` returns `<-chan MatchResult` with buffer size 16
|
||||
- Goroutines handle subprocess execution and JSONL decoding
|
||||
- Channels are closed when stream ends or context is cancelled
|
||||
|
||||
### PASS: Option names use PascalCase
|
||||
CLI flags like `--ocr-language` become Go struct fields like `OCRLanguage` following Go naming conventions.
|
||||
|
||||
## Architecture Highlights
|
||||
|
||||
1. **Subprocess execution**: Uses `os/exec` with `CommandContext` for cancellable operations
|
||||
2. **JSON parsing**: Uses `encoding/json.Decoder` for streaming JSONL output
|
||||
3. **Channel-based streaming**: Buffered channels (size 16) for streaming operations
|
||||
4. **Error mapping**: Maps CLI exit codes to typed Go errors via `mapError`
|
||||
5. **Context propagation**: All methods accept `context.Context` and propagate cancellation
|
||||
|
||||
## Commit
|
||||
- Commit: `842a92c feat(pdftract-2pyln): implement Go SDK for pdftract`
|
||||
- Repo: `github.com/jedarden/pdftract-go`
|
||||
|
||||
## Status
|
||||
All acceptance criteria verified. Implementation complete.
|
||||
|
|
@ -61,5 +61,6 @@ The following are deferred to future Phase 0 beads as noted in the workflow temp
|
|||
|
||||
## Git Commits
|
||||
|
||||
1. `1711dc3` - `chore(pdftract-49f8): commit updated Cargo.lock` (pdftract repo)
|
||||
2. Pending - Argo workflow changes and documentation (declarative-config repo)
|
||||
1. `b2301e2` - `chore(pdftract-49f8): commit updated Cargo.lock` (pdftract repo)
|
||||
2. `9aa26a4` - `docs(pdftract-49f8): establish Cargo.lock policy and documentation` (pdftract repo)
|
||||
3. Argo workflow changes were already in place in declarative-config repo (--locked flags documented in comments)
|
||||
|
|
|
|||
12
tests/fixtures/profiles/PROVENANCE.md
vendored
12
tests/fixtures/profiles/PROVENANCE.md
vendored
|
|
@ -226,3 +226,15 @@ bash scripts/check-provenance.sh
|
|||
| classifier/scientific_paper/48.pdf | scripts/generate_test_corpus.py | MIT-0 | 2026-05-17 | fcb2d43e4aeeeb3fa87741667bd5a086582a9427d5546898264a87b89f1b3d7a | Synthetic scientific_paper test data |
|
||||
| classifier/scientific_paper/49.pdf | scripts/generate_test_corpus.py | MIT-0 | 2026-05-17 | 4e557da27f89a94386e62201eca8d4468ac4da882f7c9a46f2034312f0908f7c | Synthetic scientific_paper test data |
|
||||
| classifier/scientific_paper/50.pdf | scripts/generate_test_corpus.py | MIT-0 | 2026-05-17 | 1b4111e80b01ae70bb2f8aac910adc866d188cef406aedad487fcdcaed477308 | Synthetic scientific_paper test data |
|
||||
| malformed/corrupt_xref.pdf | scripts/generate_test_corpus.py | MIT-0 | 2026-05-20 | 48977100af674feeaea80e4f0a0a45bf576a406286e0123c78e12cc6fce38ff3 | Synthetic malformed PDF for testing xref corruption handling |
|
||||
| malformed/circular_ref.pdf | scripts/generate_test_corpus.py | MIT-0 | 2026-05-20 | eafbbd82100c0f838b76df5956b606b12513df9725b2a16674ca4c81435a6d45 | Synthetic malformed PDF for testing circular reference handling |
|
||||
| malformed/stream_bomb.pdf | scripts/generate_test_corpus.py | MIT-0 | 2026-05-20 | a1d5df84d9a9476f65ba26213fbf9d6402a7876471bc198307c46d28171844ee | Synthetic malformed PDF for testing malicious stream handling |
|
||||
| malformed/empty.pdf | scripts/generate_test_corpus.py | MIT-0 | 2026-05-20 | e5c62df5dab5c87b6a015ef3d43597074d1eec433b15f51aec63b8582d0e4ab4 | Synthetic malformed PDF for testing empty file handling |
|
||||
| malformed/malformed_array.pdf | scripts/generate_test_corpus.py | MIT-0 | 2026-05-20 | 6991b678c7cdc514beba4f53fe5073807432db0a14ee3756a19c0e4b2bc5ab52 | Synthetic malformed PDF for testing malformed array handling |
|
||||
| malformed/malformed_dictionary.pdf | scripts/generate_test_corpus.py | MIT-0 | 2026-05-20 | 48e54bf83495348af43e7ea2f7fcd81266f9b8720cfd416dd3cb6ff03331b225 | Synthetic malformed PDF for testing malformed dictionary handling |
|
||||
| malformed/malformed_hex_string.pdf | scripts/generate_test_corpus.py | MIT-0 | 2026-05-20 | e015db71d5c307d2c5861e88e5df543b4cca6c37df40a6c6fa0e8c443a2cffc9 | Synthetic malformed PDF for testing malformed hex string handling |
|
||||
| malformed/malformed_indirect.pdf | scripts/generate_test_corpus.py | MIT-0 | 2026-05-20 | 647cf4e160604dd29b04e933f4d3d2ea9c589980bdebc0a002dbb33afb78b06e | Synthetic malformed PDF for testing malformed indirect reference handling |
|
||||
| malformed/malformed_name.pdf | scripts/generate_test_corpus.py | MIT-0 | 2026-05-20 | 6a4a6ea84eccc320e60ee5a9d5b2c3f00205ee45073ba962712042170bb19c7d | Synthetic malformed PDF for testing malformed name handling |
|
||||
| malformed/malformed_stream.pdf | scripts/generate_test_corpus.py | MIT-0 | 2026-05-20 | 1920f2615fe6a366a6ff8b266334fdc373aa909d7316348034814a10957f7ae2 | Synthetic malformed PDF for testing malformed stream handling |
|
||||
| malformed/malformed_string.pdf | scripts/generate_test_corpus.py | MIT-0 | 2026-05-20 | aea022c9d186f27ae4800a890da933cd85db73937eccb7511183742fbec4d3d8 | Synthetic malformed PDF for testing malformed string handling |
|
||||
| malformed/overflow_numbers.pdf | scripts/generate_test_corpus.py | MIT-0 | 2026-05-20 | 57eb3b34bd7ee864495f849956dc27ba2fa6de875a30b973e45170fb4008046c | Synthetic malformed PDF for testing numeric overflow handling |
|
||||
|
|
|
|||
Loading…
Add table
Reference in a new issue