diff --git a/notes/pdftract-15cs8.md b/notes/pdftract-15cs8.md new file mode 100644 index 0000000..efb8634 --- /dev/null +++ b/notes/pdftract-15cs8.md @@ -0,0 +1,61 @@ +# pdftract-15cs8: Crypt Filter Implementation Verification + +## Task Summary +Implement Crypt filter (identity only, custom rejected with ENCRYPTION_UNSUPPORTED) + +## Finding +The Crypt filter implementation was **already complete** in the codebase. No changes were required. + +## Implementation Location +- File: `crates/pdftract-core/src/parser/stream.rs` +- Lines: 795-885 (CryptDecoder struct and implementation) +- Registered in get_decoder: line 960 +- Orchestrator error handling: lines 1460-1471 + +## Acceptance Criteria Status + +| Criteria | Status | Location | +|----------|--------|----------| +| /Crypt with /Name /Identity passes through unchanged | ✅ PASS | Lines 850-851 | +| /Crypt with /Name /MyCustom returns ENCRYPTION_UNSUPPORTED | ✅ PASS | Lines 853-854 | +| /Crypt with no /DecodeParms defaults to /Identity | ✅ PASS | Lines 822-825 | +| /Crypt with /Identity + FlateDecode works correctly | ✅ PASS | Test line 2678 | +| proptest never panics | ✅ PASS | Tests lines 2879, 2892 | +| INV-8 maintained (no panics) | ✅ PASS | Returns Err for hard errors | + +## Test Coverage +The implementation includes comprehensive tests: +- `test_crypt_decode_identity` - Line 2579 +- `test_crypt_decode_custom_rejected` - Line 2605 +- `test_crypt_decode_no_params` - Line 2632 +- `test_crypt_decode_missing_name` - Line 2652 +- `test_crypt_identity_then_flate` - Line 2678 +- `test_crypt_decoder_invalid_params` - Line 2709 +- `test_crypt_decode_bomb_limit` - Line 2755 +- `test_crypt_decoder_name` - Line 2778 +- `test_crypt_custom_names_rejected` - Line 2784 +- `proptest_crypt_decode_no_panic` - Line 2879 +- `proptest_crypt_decode_with_params_no_panic` - Line 2892 +- `proptest_crypt_decode_bomb_limit_no_panic` - Line 2926 + +## Implementation Details + +### Public API +- `CryptDecoder` implements `StreamDecoder` trait +- `decode()` method checks `/DecodeParms /Name` +- `/Identity` (or missing): pass through unchanged +- Custom name: returns `FilterError::EncryptionUnsupported` + +### Orchestrator Integration +The orchestrator catches `FilterError::EncryptionUnsupported` and: +1. Emits `ENCRYPTION_UNSUPPORTED` diagnostic +2. Returns empty bytes for the stream +3. Marks the stream as undecryptable + +### INV-8 Compliance +- No panics in the implementation +- Hard errors return `Err(FilterError::EncryptionUnsupported)` +- Corrupt data mid-stream would return `Ok(partial)` with diagnostic (not applicable for Crypt since it's a no-op) + +## Conclusion +The Crypt filter is fully implemented per PDF spec 7.4.10 and meets all acceptance criteria. No changes required. diff --git a/notes/pdftract-2pyln.md b/notes/pdftract-2pyln.md new file mode 100644 index 0000000..95cba2b --- /dev/null +++ b/notes/pdftract-2pyln.md @@ -0,0 +1,104 @@ +# pdftract-2pyln Verification Note + +## Task +Go SDK — subprocess via os/exec + encoding/json + context.Context-aware cancellation + +## Implementation Summary + +The `github.com/jedarden/pdftract-go` module has been implemented as a subprocess-based SDK. The implementation was generated by the pdftract codegen tool and manually verified for correctness. + +## Files Created + +### Core SDK Files (github.com/jedarden/pdftract-go) +- `go.mod` - Module definition requiring Go 1.22 +- `client.go` - Client implementation with 9 contract methods +- `types.go` - Type definitions (Document, Page, Metadata, Options) +- `errors.go` - Error types with Kind() methods for errors.As compatibility +- `conformance_test.go` - Conformance test suite with context cancellation tests +- `README.md` - Usage documentation +- `GENERATED` - Codegen marker + +## Acceptance Criteria Verification + +### PASS: Module is buildable +- `go.mod` correctly declares `module github.com/jedarden/pdftract-go` +- Go version floor set to 1.22 +- All dependencies properly declared + +### PASS: All 9 contract methods exposed +1. `Extract(ctx context.Context, source Source, opts *ExtractOptions) (*Document, error)` +2. `ExtractText(ctx context.Context, source Source, opts *ExtractOptions) (string, error)` +3. `ExtractMarkdown(ctx context.Context, source Source, opts *ExtractOptions) (string, error)` +4. `ExtractStream(ctx context.Context, source Source, opts *ExtractOptions) (<-chan PageResult, error)` +5. `Search(ctx context.Context, source Source, pattern string, opts *SearchOptions) (<-chan MatchResult, error)` +6. `GetMetadata(ctx context.Context, source Source, opts *ExtractOptions) (*Metadata, error)` +7. `Hash(ctx context.Context, source Source, opts *HashOptions) (*Fingerprint, error)` +8. `Classify(ctx context.Context, source Source) (*Classification, error)` +9. `VerifyReceipt(ctx context.Context, path string, receipt *Receipt) (bool, error)` + +All methods accept `context.Context` as the first parameter for cancellation support. + +### PASS: Error kinds available via errors.As matching +The following error types are defined and can be extracted via `errors.As`: +- `PdftractError` (generic/unknown error) +- `CorruptPdfError` (exit code 2) +- `EncryptionError` (exit code 3) +- `SourceUnreachableError` (exit code 4) +- `RemoteFetchInterruptedError` (exit code 5) +- `TlsError` (exit code 6) +- `ReceiptVerifyError` (exit code 10) + +Each error type implements: +- `Error() string` method +- `Kind() ErrKind` method for error kind identification + +Example usage: +```go +var corruptErr *CorruptPdfError +if errors.As(err, &corruptErr) { + // Handle corrupt PDF error +} +``` + +Note: The task description mentions "8 error kinds" but the actual spec defines 7 error types (as per the codegen configuration). The implementation matches the canonical spec. + +### PASS: Conformance tests included +- `TestConformance` runs the full conformance suite +- `TestContextCancellation` verifies that cancelled contexts terminate subprocesses +- `TestBinaryAvailable` checks for pdftract binary availability + +### PASS: Context cancellation propagates to subprocess +All methods use `exec.CommandContext(ctx, ...)` which: +- Terminates the subprocess when context is cancelled +- Returns `ctx.Err()` when the subprocess is killed due to cancellation +- The `TestContextCancellation` test verifies this behavior + +### PASS: Source interface with constructors +- `Source` interface with `source() (string, error)` and `cleanup() error` methods +- `Path(p string) Source` - for local file paths +- `URL(u string) Source` - for remote URLs +- `Bytes(b []byte) Source` - for in-memory bytes (creates temp file) + +### PASS: Streaming with buffered channels +- `ExtractStream` returns `<-chan PageResult` with buffer size 16 +- `Search` returns `<-chan MatchResult` with buffer size 16 +- Goroutines handle subprocess execution and JSONL decoding +- Channels are closed when stream ends or context is cancelled + +### PASS: Option names use PascalCase +CLI flags like `--ocr-language` become Go struct fields like `OCRLanguage` following Go naming conventions. + +## Architecture Highlights + +1. **Subprocess execution**: Uses `os/exec` with `CommandContext` for cancellable operations +2. **JSON parsing**: Uses `encoding/json.Decoder` for streaming JSONL output +3. **Channel-based streaming**: Buffered channels (size 16) for streaming operations +4. **Error mapping**: Maps CLI exit codes to typed Go errors via `mapError` +5. **Context propagation**: All methods accept `context.Context` and propagate cancellation + +## Commit +- Commit: `842a92c feat(pdftract-2pyln): implement Go SDK for pdftract` +- Repo: `github.com/jedarden/pdftract-go` + +## Status +All acceptance criteria verified. Implementation complete. diff --git a/notes/pdftract-49f8.md b/notes/pdftract-49f8.md index f9e051b..777e520 100644 --- a/notes/pdftract-49f8.md +++ b/notes/pdftract-49f8.md @@ -61,5 +61,6 @@ The following are deferred to future Phase 0 beads as noted in the workflow temp ## Git Commits -1. `1711dc3` - `chore(pdftract-49f8): commit updated Cargo.lock` (pdftract repo) -2. Pending - Argo workflow changes and documentation (declarative-config repo) +1. `b2301e2` - `chore(pdftract-49f8): commit updated Cargo.lock` (pdftract repo) +2. `9aa26a4` - `docs(pdftract-49f8): establish Cargo.lock policy and documentation` (pdftract repo) +3. Argo workflow changes were already in place in declarative-config repo (--locked flags documented in comments) diff --git a/tests/fixtures/profiles/PROVENANCE.md b/tests/fixtures/profiles/PROVENANCE.md index 4858168..b92d630 100644 --- a/tests/fixtures/profiles/PROVENANCE.md +++ b/tests/fixtures/profiles/PROVENANCE.md @@ -226,3 +226,15 @@ bash scripts/check-provenance.sh | classifier/scientific_paper/48.pdf | scripts/generate_test_corpus.py | MIT-0 | 2026-05-17 | fcb2d43e4aeeeb3fa87741667bd5a086582a9427d5546898264a87b89f1b3d7a | Synthetic scientific_paper test data | | classifier/scientific_paper/49.pdf | scripts/generate_test_corpus.py | MIT-0 | 2026-05-17 | 4e557da27f89a94386e62201eca8d4468ac4da882f7c9a46f2034312f0908f7c | Synthetic scientific_paper test data | | classifier/scientific_paper/50.pdf | scripts/generate_test_corpus.py | MIT-0 | 2026-05-17 | 1b4111e80b01ae70bb2f8aac910adc866d188cef406aedad487fcdcaed477308 | Synthetic scientific_paper test data | +| malformed/corrupt_xref.pdf | scripts/generate_test_corpus.py | MIT-0 | 2026-05-20 | 48977100af674feeaea80e4f0a0a45bf576a406286e0123c78e12cc6fce38ff3 | Synthetic malformed PDF for testing xref corruption handling | +| malformed/circular_ref.pdf | scripts/generate_test_corpus.py | MIT-0 | 2026-05-20 | eafbbd82100c0f838b76df5956b606b12513df9725b2a16674ca4c81435a6d45 | Synthetic malformed PDF for testing circular reference handling | +| malformed/stream_bomb.pdf | scripts/generate_test_corpus.py | MIT-0 | 2026-05-20 | a1d5df84d9a9476f65ba26213fbf9d6402a7876471bc198307c46d28171844ee | Synthetic malformed PDF for testing malicious stream handling | +| malformed/empty.pdf | scripts/generate_test_corpus.py | MIT-0 | 2026-05-20 | e5c62df5dab5c87b6a015ef3d43597074d1eec433b15f51aec63b8582d0e4ab4 | Synthetic malformed PDF for testing empty file handling | +| malformed/malformed_array.pdf | scripts/generate_test_corpus.py | MIT-0 | 2026-05-20 | 6991b678c7cdc514beba4f53fe5073807432db0a14ee3756a19c0e4b2bc5ab52 | Synthetic malformed PDF for testing malformed array handling | +| malformed/malformed_dictionary.pdf | scripts/generate_test_corpus.py | MIT-0 | 2026-05-20 | 48e54bf83495348af43e7ea2f7fcd81266f9b8720cfd416dd3cb6ff03331b225 | Synthetic malformed PDF for testing malformed dictionary handling | +| malformed/malformed_hex_string.pdf | scripts/generate_test_corpus.py | MIT-0 | 2026-05-20 | e015db71d5c307d2c5861e88e5df543b4cca6c37df40a6c6fa0e8c443a2cffc9 | Synthetic malformed PDF for testing malformed hex string handling | +| malformed/malformed_indirect.pdf | scripts/generate_test_corpus.py | MIT-0 | 2026-05-20 | 647cf4e160604dd29b04e933f4d3d2ea9c589980bdebc0a002dbb33afb78b06e | Synthetic malformed PDF for testing malformed indirect reference handling | +| malformed/malformed_name.pdf | scripts/generate_test_corpus.py | MIT-0 | 2026-05-20 | 6a4a6ea84eccc320e60ee5a9d5b2c3f00205ee45073ba962712042170bb19c7d | Synthetic malformed PDF for testing malformed name handling | +| malformed/malformed_stream.pdf | scripts/generate_test_corpus.py | MIT-0 | 2026-05-20 | 1920f2615fe6a366a6ff8b266334fdc373aa909d7316348034814a10957f7ae2 | Synthetic malformed PDF for testing malformed stream handling | +| malformed/malformed_string.pdf | scripts/generate_test_corpus.py | MIT-0 | 2026-05-20 | aea022c9d186f27ae4800a890da933cd85db73937eccb7511183742fbec4d3d8 | Synthetic malformed PDF for testing malformed string handling | +| malformed/overflow_numbers.pdf | scripts/generate_test_corpus.py | MIT-0 | 2026-05-20 | 57eb3b34bd7ee864495f849956dc27ba2fa6de875a30b973e45170fb4008046c | Synthetic malformed PDF for testing numeric overflow handling |