- Fix Token::Keyword to use b"..." .to_vec() instead of static strings - Improve unknown keyword diagnostics to show actual keyword bytes - Remove unused has_valid_line_ending variable in stream keyword lexer - Add stream_header_valid_line_endings test for stream keyword validation All hex string lexer tests pass (16 unit tests + 2 proptests). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> Bead-Id: pdftract-2hm4
5 KiB
pdftract-2hm4: Implement PDF hex string lexer with odd-length zero padding
Summary
This bead implements the PDF hex string lexer that decodes hex strings of the form <...> into byte sequences. The implementation was already present in the codebase; this bead renamed diagnostic codes to use the STRUCT_ prefix as specified in the bead description.
Work Done
1. Renamed Diagnostic Codes
The lexer's DiagCode enum variants were renamed to use the STRUCT_ prefix:
InvalidName->StructInvalidNameInvalidHex->StructInvalidHexInvalidOctal->StructInvalidOctalInvalidStreamHeader->StructInvalidStreamHeaderUnexpectedEof->StructUnexpectedEofUnterminatedString->StructUnterminatedString
All references throughout the lexer module were updated accordingly.
2. Added Hex String Proptests
To fully satisfy the acceptance criteria, added two hex string-specific proptests:
-
proptest_hex_string_never_panics_on_random_bytes: Verifies that random byte sequences starting with<(but not<<) never cause the lexer to panic. The test generates random byte vectors and ensures they start with<but not<<. -
proptest_hex_string_roundtrip_via_reencode: Verifies the roundtrip property for hex strings. Bytes are encoded to hex, decoded, re-encoded, and decoded again - the final result should equal the original. This validates that decoding and encoding are inverse operations (modulo case and whitespace differences).
3. Existing Implementation Verified
The hex string lexer (lex_hex_string()) was already implemented with:
- Hex digit pair decoding:
<48656C6C6F>->b"Hello" - Embedded whitespace handling (PDF spec 7.2.2 whitespace ignored)
- Odd-length zero padding:
<4>->b"\x40"(dangling nibble becomes HIGH nibble with LOW nibble 0) - Both lowercase and uppercase hex digits accepted
- Invalid character handling with
STRUCT_INVALID_HEXdiagnostic - Unterminated string handling with
STRUCT_UNTERMINATED_STRINGdiagnostic
4. Files Modified
crates/pdftract-core/src/parser/lexer/mod.rs: Renamed 6DiagCodeenum variants and updated all references; added two hex string proptests
5. Compilation Fixes (2025-05-18)
Fixed compilation errors that were preventing the tests from running:
crates/pdftract-core/src/parser/object/mod.rs: Addedparsermodule export andObjectParserto public exportscrates/pdftract-core/src/parser/catalog.rs: Addedcodefield to allDiagnosticinstantiations (required after Diagnostic struct refactor)crates/pdftract-core/src/parser/objstm.rs: Fixed mutability ofdiagsintake_diagnostics()method
Acceptance Criteria Status
| Criterion | Status | Notes |
|---|---|---|
<> -> b"" |
PASS | hex_string_empty |
<4> -> b"\x40" (NOT \x04) |
PASS | hex_string_odd_length_single_nibble |
<48656C6C6F> -> b"Hello" |
PASS | hex_string_hello_world |
<aBcD> -> b"\xAB\xCD" |
PASS | hex_string_mixed_case |
<48 65> -> whitespace ignored |
PASS | hex_string_with_whitespace |
Unterminated <48 -> diagnostic |
PASS | hex_string_unterminated_emits_diagnostic |
| proptest: hex random bytes never panic | PASS | proptest_hex_string_never_panics_on_random_bytes |
| proptest: hex roundtrip property | PASS | proptest_hex_string_roundtrip_via_reencode |
| INV-8 maintained | PASS | All error paths use diagnostics, no panics |
Test Results
cargo test --lib parser::lexer::tests::hex_string
test result: ok. 16 passed; 0 failed
All hex string tests pass:
hex_string_empty:<>->b""hex_string_odd_length_single_nibble:<4>->b"\x40"(critical test)hex_string_hello_world:<48656C6C6F>->b"Hello"hex_string_mixed_case:<aBcD>->b"\xAB\xCD"hex_string_with_whitespace:<48 65 6C\n6C 6F>->b"Hello"hex_string_invalid_char_emits_diagnostic:<48Z65>->b"\x48\x65"+STRUCT_INVALID_HEXhex_string_unterminated_emits_diagnostic:<4865->b"\x48\x65"+STRUCT_UNTERMINATED_STRINGhex_string_unterminated_with_dangling_nibble:<48657->b"\x48\x65\x70"+ diagnostic- And 8 more tests covering edge cases
Proptests also pass:
proptest_string_never_panics_on_random_bytes: Random bytes never panicproptest_valid_string_roundtrips: Decode+encode roundtrip propertyproptest_hex_string_never_panics_on_random_bytes: Random bytes starting with<(not<<) never panicproptest_hex_string_roundtrip_via_reencode: Hex decode + re-encode roundtrip property
Implementation Notes
The critical odd-length padding behavior is correct:
<4>->b"\x40"(the nibble4becomes the HIGH nibble, LOW nibble is 0)<48657>->b"\x48\x65\x70"(dangling7becomes0x70, NOT0x07)
This is the opposite of intuition (trailing zero is LOW, not HIGH) and is a common bug source.
References
- Plan section: Phase 1.1 Lexer, line 1032-1033 (hex strings, odd-length zero padding); line 1046 (critical test)
- PDF spec 7.3.4.3 (Hexadecimal Strings)
- Files modified:
- crates/pdftract-core/src/parser/lexer/mod.rs