Add verify_receipt method support to Go templates: - client.go.tera: Add verify_receipt with string params (path, receipt) - conformance_test.go.tera: Add testVerifyReceipt test case Code generator cleanup: - Add uses_string_params and string_param_count to Method struct - Fix unused variable warnings in contract parsing - Document TODO for full markdown contract parsing Verification: - All 9 methods generated correctly (extract, extract_text, extract_markdown, extract_stream, search, get_metadata, hash, classify, verify_receipt) - All 7 error types generated with exit code mapping - Drift detection working (validate command) - Protection against overwriting hand-written code (GENERATED marker) See notes/pdftract-1534.md for full acceptance criteria status. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> Bead-Id: pdftract-1534
77 lines
3.1 KiB
Markdown
77 lines
3.1 KiB
Markdown
# pdftract-3gq3: PDF String Literal Lexer Implementation
|
|
|
|
## Summary
|
|
|
|
Implemented PDF string literal lexer with octal escapes and balanced parentheses.
|
|
|
|
## What Was Done
|
|
|
|
### 1. Verified Existing Implementation
|
|
|
|
The `lex_literal_string()` function in `/home/coding/pdftract/crates/pdftract-core/src/parser/lexer/mod.rs` (lines 535-656) already implements all required functionality:
|
|
|
|
- **Escape sequences**: `\n`, `\r`, `\t`, `\b`, `\f`, `\\`, `\(`, `\)`
|
|
- **Octal escapes**: `\ddd` consuming 1-3 octal digits
|
|
- **Line continuation**: `\<newline>` (LF, CR, CRLF)
|
|
- **Nested balanced parens**: Depth tracking with `depth: usize`
|
|
- **Out-of-range octals**: Truncated with `STRUCT_INVALID_OCTAL` diagnostic
|
|
- **Unterminated strings**: `STRUCT_UNTERMINATED_STRING` diagnostic
|
|
- **Unknown escapes**: Emit literal character per PDF spec
|
|
|
|
### 2. Added Proptests
|
|
|
|
Added two property tests to the lexer test module (lines 1305-1397):
|
|
|
|
1. **`proptest_string_never_panics_on_random_bytes`**: Verifies that random byte sequences starting with `(` never panic
|
|
2. **`proptest_valid_string_roundtrips`**: Verifies that valid `(...)` strings produce non-empty `Token::String` output
|
|
|
|
## Acceptance Criteria
|
|
|
|
All acceptance criteria PASS:
|
|
|
|
### Critical Tests
|
|
|
|
| Test | Input | Expected Output | Status |
|
|
|------|-------|-----------------|--------|
|
|
| Balanced parens | `(foo (bar) baz)` | `b"foo (bar) baz"` | PASS |
|
|
| Octal escape | `(abc\101)` | `b"abcA"` | PASS |
|
|
| Octal with non-octal | `(abc\10A)` | `b"abc\x08A"` | PASS |
|
|
| Line continuation | `(abc\<LF>def)` | `b"abcdef"` | PASS |
|
|
| Unterminated | `(unterminated` | Partial bytes + diagnostic | PASS |
|
|
|
|
### Proptests
|
|
|
|
| Property | Status |
|
|
|----------|--------|
|
|
| Random bytes starting with `(` never panic | PASS |
|
|
| Valid `(...)` round-trips to non-empty `Token::String` | PASS |
|
|
|
|
### INV-8 Compliance
|
|
|
|
No `unwrap()`, `expect()`, or `panic!` in the lexer code. All errors are emitted as diagnostics.
|
|
|
|
## Test Results
|
|
|
|
```
|
|
test result: ok. 54 passed; 0 failed; 0 ignored; 0 measured
|
|
```
|
|
|
|
All 54 lexer tests pass, including 23 string literal tests, 2 proptests, and 29 other lexer tests.
|
|
|
|
## Files Modified
|
|
|
|
- `/home/coding/pdftract/crates/pdftract-core/src/parser/lexer/mod.rs`: Added proptests (lines 1305-1397)
|
|
|
|
## Implementation Notes
|
|
|
|
The existing implementation correctly handles:
|
|
|
|
1. **Paren depth tracking**: Starts at `depth = 1` after opening `(`, increments on `(`, decrements on `)`. Only terminates when `depth == 0`.
|
|
|
|
2. **Octal escape parsing**: Greedily consumes up to 3 octal digits (0-7). Non-octal digits terminate the escape sequence and are treated as literal text.
|
|
|
|
3. **Line ending normalization**: Per PDF spec 7.3.4.2, bare `\r` inside strings is NOT normalized to `\n` - the implementation emits `\r` literally. This is spec-compliant as the normalization applies to line endings in the PDF file structure, not within string literals.
|
|
|
|
4. **Out-of-range octals**: Values > 255 are truncated via `(value & 0xFF)` and a diagnostic is emitted.
|
|
|
|
5. **Unknown escapes**: Emit the escaped character literally (e.g., `\q` → `q`) with no diagnostic, as permitted by PDF spec 7.3.4.2.
|