jedarden 7044c746f9 feat(pdftract-1534): complete Tera-template-driven code generator

Add verify_receipt method support to Go templates:
- client.go.tera: Add verify_receipt with string params (path, receipt)
- conformance_test.go.tera: Add testVerifyReceipt test case

Code generator cleanup:
- Add uses_string_params and string_param_count to Method struct
- Fix unused variable warnings in contract parsing
- Document TODO for full markdown contract parsing

Verification:
- All 9 methods generated correctly (extract, extract_text, extract_markdown, extract_stream, search, get_metadata, hash, classify, verify_receipt)
- All 7 error types generated with exit code mapping
- Drift detection working (validate command)
- Protection against overwriting hand-written code (GENERATED marker)

See notes/pdftract-1534.md for full acceptance criteria status.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Bead-Id: pdftract-1534

2026-05-18 01:55:27 -04:00

3.1 KiB

Raw Blame History

pdftract-3gq3: PDF String Literal Lexer Implementation

Summary

Implemented PDF string literal lexer with octal escapes and balanced parentheses.

What Was Done

1. Verified Existing Implementation

The lex_literal_string() function in /home/coding/pdftract/crates/pdftract-core/src/parser/lexer/mod.rs (lines 535-656) already implements all required functionality:

Escape sequences: \n, \r, \t, \b, \f, \\, \(, \)
Octal escapes: \ddd consuming 1-3 octal digits
Line continuation: \<newline> (LF, CR, CRLF)
Nested balanced parens: Depth tracking with depth: usize
Out-of-range octals: Truncated with STRUCT_INVALID_OCTAL diagnostic
Unterminated strings: STRUCT_UNTERMINATED_STRING diagnostic
Unknown escapes: Emit literal character per PDF spec

2. Added Proptests

Added two property tests to the lexer test module (lines 1305-1397):

proptest_string_never_panics_on_random_bytes: Verifies that random byte sequences starting with ( never panic
proptest_valid_string_roundtrips: Verifies that valid (...) strings produce non-empty Token::String output

Acceptance Criteria

All acceptance criteria PASS:

Critical Tests

Test	Input	Expected Output	Status
Balanced parens	`(foo (bar) baz)`	`b"foo (bar) baz"`	PASS
Octal escape	`(abc\101)`	`b"abcA"`	PASS
Octal with non-octal	`(abc\10A)`	`b"abc\x08A"`	PASS
Line continuation	`(abc\<LF>def)`	`b"abcdef"`	PASS
Unterminated	`(unterminated`	Partial bytes + diagnostic	PASS

Proptests

Property	Status
Random bytes starting with `(` never panic	PASS
Valid `(...)` round-trips to non-empty `Token::String`	PASS

INV-8 Compliance

No unwrap(), expect(), or panic! in the lexer code. All errors are emitted as diagnostics.

Test Results

test result: ok. 54 passed; 0 failed; 0 ignored; 0 measured

All 54 lexer tests pass, including 23 string literal tests, 2 proptests, and 29 other lexer tests.

Files Modified

/home/coding/pdftract/crates/pdftract-core/src/parser/lexer/mod.rs: Added proptests (lines 1305-1397)

Implementation Notes

The existing implementation correctly handles:

Paren depth tracking: Starts at depth = 1 after opening (, increments on (, decrements on ). Only terminates when depth == 0.
Octal escape parsing: Greedily consumes up to 3 octal digits (0-7). Non-octal digits terminate the escape sequence and are treated as literal text.
Line ending normalization: Per PDF spec 7.3.4.2, bare \r inside strings is NOT normalized to \n - the implementation emits \r literally. This is spec-compliant as the normalization applies to line endings in the PDF file structure, not within string literals.
Out-of-range octals: Values > 255 are truncated via (value & 0xFF) and a diagnostic is emitted.
Unknown escapes: Emit the escaped character literally (e.g., \q → q) with no diagnostic, as permitted by PDF spec 7.3.4.2.

3.1 KiB Raw Blame History