diff --git a/crates/pdftract-core/src/parser/lexer/mod.rs b/crates/pdftract-core/src/parser/lexer/mod.rs index 50976ee..7b8a3a8 100644 --- a/crates/pdftract-core/src/parser/lexer/mod.rs +++ b/crates/pdftract-core/src/parser/lexer/mod.rs @@ -1033,8 +1033,8 @@ impl<'a> Lexer<'a> { return Some(Token::EndObj); } } - // Not a recognized keyword, treat as name - self.lex_name() + // Not a recognized keyword, treat as generic keyword + self.lex_keyword() } fn lex_o_keyword(&mut self) -> Option { @@ -1046,8 +1046,8 @@ impl<'a> Lexer<'a> { return Some(Token::Obj); } } - // Not "obj", treat as name - self.lex_name() + // Not "obj", treat as generic keyword + self.lex_keyword() } fn lex_r_keyword(&mut self) -> Option { @@ -1057,7 +1057,7 @@ impl<'a> Lexer<'a> { self.advance(1); Some(Token::IndirectRef) } else { - self.lex_name() + self.lex_keyword() } } @@ -1070,8 +1070,8 @@ impl<'a> Lexer<'a> { return Some(Token::Null); } } - // Not "null", treat as name - self.lex_name() + // Not "null", treat as generic keyword + self.lex_keyword() } fn lex_unknown(&mut self) -> Option { diff --git a/notes/pdftract-5upi.md b/notes/pdftract-5upi.md index b205471..6c697b7 100644 --- a/notes/pdftract-5upi.md +++ b/notes/pdftract-5upi.md @@ -1,97 +1,76 @@ -# pdftract-5upi: Structural Token Lexer +# pdftract-5upi: Structural Token Lexer Bug Fix ## Summary -The structural token lexer was already fully implemented. This verification confirms that all acceptance criteria tests pass. The only change made was fixing a pre-existing compilation error in `xref.rs` by adding the missing `parse_obj_header_at_memory` function. +Fixed incorrect fallback behavior in keyword lexer functions. Four functions (`lex_e_keyword`, `lex_o_keyword`, `lex_r_keyword`, `lex_n_keyword`) were incorrectly calling `lex_name()` instead of `lex_keyword()` when keywords didn't match, causing incorrect parsing of unrecognized keywords starting with those letters. -## Acceptance Criteria Status +## Bug Description -### All Critical Tests PASS +When a PDF contains an unrecognized word starting with `e`, `o`, `n`, or `R` (e.g., a typo like "endob" instead of "endobj"), the lexer should fall back to generic keyword parsing, which emits `Token::Keyword(bytes)`. Instead, these functions were calling `lex_name()`, which expects to parse a name object (always starting with `/`). -1. **Array delimiters** (`[1 2 3]`): `array_delimiters` test PASSED - - ArrayStart, Integer(1), Integer(2), Integer(3), ArrayEnd, Eof - -2. **Dict delimiters** (`<< /A 1 >>`): `dict_delimiters` test PASSED - - DictStart, Name(b"A"), Integer(1), DictEnd, Eof - -3. **Hex string not dict** (`<48>`): `hex_string_odd_length_single_nibble` test PASSED - - String(b"\x48"), Eof — correctly dispatches `<` followed by non-`<` to hex lexer - -4. **Dict start, hex string, dict end** (`<<<48>>>`): `hex_string_dict_start_hex_string_dict_end` test PASSED - - DictStart, String(b"\x48"), DictEnd - -5. **Boolean and null keywords** (`true false null`): `bool_literals` and `null_keyword` tests PASSED - - Bool(true), Bool(false), Null, Eof - -6. **Object keywords** (`12 0 obj null endobj`): `obj_keywords` test PASSED - - Integer(12), Integer(0), Obj, Null, EndObj, Eof - -7. **Indirect reference** (`5 0 R`): `indirect_ref_keyword` test PASSED - - Integer(5), Integer(0), IndirectRef, Eof - -8. **Stream keywords** (`stream\n...endstream`): `stream_keywords` and `stream_header_valid_line_endings` tests PASSED - - Token::Stream, then Token::EndStream - -9. **Invalid stream header** (`stream\rxxx`): `stream_header_lone_cr_emits_diagnostic` test PASSED - - Token::Stream + `STRUCT_INVALID_STREAM_HEADER` diagnostic (lone `\r` is invalid) - -10. **Case-mismatched keyword** (`True`): `bool_case_sensitive` test PASSED - - Token::Keyword(b"True"), Eof (object parser will reject) - -### Proptests PASS - -- `proptest_hex_string_never_panics_on_random_bytes`: PASSED -- `proptest_hex_string_roundtrip_via_reencode`: PASSED -- `proptest_string_never_panics_on_random_bytes`: PASSED -- `proptest_valid_string_roundtrips`: PASSED -- `name_proptest_never_panics_on_random_bytes`: PASSED -- `name_proptest_always_produces_valid_token`: PASSED - -## Implementation Details - -The structural token lexer dispatches from `next_token()` as follows: - -- `[` / `]` → ArrayStart / ArrayEnd (direct return) -- `<` → peek next byte: if `<`, return DictStart (advance 2); else hex string lexer -- `>` → peek next byte: if `>`, return DictEnd (advance 2); else emit STRUCT_UNEXPECTED_BYTE -- `t` → check for "true" (Bool(true)) or "trailer" (Keyword), else lex_keyword -- `f` → check for "false" (Bool(false)), else lex_keyword -- `n` → check for "null" (Null), else lex_name -- `o` → check for "obj" (Obj), else lex_name -- `e` → check for "endstream" (EndStream) or "endobj" (EndObj), else lex_name -- `s` → check for "stream" (Stream with line ending validation) or "startxref" (Keyword) -- `R` → IndirectRef -- `x` → check for "xref" (Keyword) -- `%` → check for "%%EOF" (Keyword) or skip comment - -### Stream Header Validation - -Per PDF spec 7.3.8.1, the `stream` keyword must be followed by `\n` or `\r\n`. A lone `\r` is INVALID: - -```rust -// In lex_s_keyword(): -if let Some(&b'\n') = self.bytes.first() { - self.advance(1); // \n is valid -} else if let Some(&b'\r') = self.bytes.first() { - self.advance(1); - if let Some(&b'\n') = self.bytes.first() { - self.advance(1); // \r\n is valid - } else { - // Lone \r - emit STRUCT_INVALID_STREAM_HEADER - } -} -``` +**Why this matters:** Names in PDF always start with `/`. The `lex_name()` function immediately advances past the leading `/`. When called on input like "endob" (no leading `/`), it would incorrectly skip the first byte and try to parse the rest as a name, producing wrong results. ## Changes Made -Fixed a pre-existing compilation error in `xref.rs` by adding the missing `parse_obj_header_at_memory` function. This function is a variant of `parse_obj_header_at` that works directly with a byte slice instead of a `PdfSource`, used by the `forward_scan_memory` function for efficient scanning of small files. +File: `crates/pdftract-core/src/parser/lexer/mod.rs` -File: `crates/pdftract-core/src/parser/xref.rs` -- Added `parse_obj_header_at_memory` function (lines 1120-1189) +1. **`lex_e_keyword()` (line 1037)**: Changed `self.lex_name()` to `self.lex_keyword()` + - Handles: "endstream", "endobj" + - Fallback example: "endob" → `Token::Keyword(b"endob")` (was incorrectly parsed as name) -## INV-8 Status +2. **`lex_o_keyword()` (line 1050)**: Changed `self.lex_name()` to `self.lex_keyword()` + - Handles: "obj" + - Fallback example: "ob" → `Token::Keyword(b"ob")` (was incorrectly parsed as name) -INV-8 (lexer never panics on invalid input) is maintained: -- All proptests use random byte sequences and verify no panics -- Every lexer branch handles EOF gracefully -- Unknown keywords emit Token::Keyword instead of panicking +3. **`lex_r_keyword()` (line 1060)**: Changed `self.lex_name()` to `self.lex_keyword()` + - Handles: "R" (indirect reference) + - Fallback example: "Ref" → `Token::Keyword(b"Ref")` (was incorrectly parsed as name) + +4. **`lex_n_keyword()` (line 1074)**: Changed `self.lex_name()` to `self.lex_keyword()` + - Handles: "null" + - Fallback example: "nul" → `Token::Keyword(b"nul")` (was incorrectly parsed as name) + +## Verification + +### Pre-existing Implementation Status + +The structural token lexer was already fully implemented. This fix only corrected the fallback behavior: + +| Feature | Status | Location | +|---------|--------|----------| +| Array delimiters `[` `]` | ✅ | Lines 379-380 | +| Dict delimiters `<<` `>>` | ✅ | Lines 866-870, 952-956 | +| Boolean keywords `true` `false` | ✅ | Lines 474-505 | +| Null keyword `null` | ✅ | Lines 1064-1074 | +| Obj keywords `obj` `endobj` | ✅ | Lines 1028-1050 | +| Stream keywords `stream` `endstream` | ✅ | Lines 969-1036 | +| Indirect ref `R` | ✅ | Lines 1053-1061 | +| Xref keywords `xref` `trailer` `startxref` | ✅ | Lines 483-518, 1007-1014 | +| `%%EOF` marker | ✅ | Lines 521-528 | +| Stream header validation (PDF 7.3.8.1) | ✅ | Lines 978-1002 | + +### Acceptance Criteria Tests (all present in code) + +1. **Array with integers** (line 2016): `[1 2 3]` → ArrayStart, Integer(1), Integer(2), Integer(3), ArrayEnd, Eof +2. **Dict with name and integer** (line 2028): `<< /A 1 >>` → DictStart, Name(b"A"), Integer(1), DictEnd, Eof +3. **Hex string (not dict)** (line 1437): `<48>` → String(b"\x48"), Eof +4. **Dict-hex-dict ambiguity** (line 1576): `<<<48>>>` → DictStart, String(b"\x48"), DictEnd +5. **Boolean and null** (line 2061): `true false null` → Bool(true), Bool(false), Null, Eof +6. **Indirect object header** (line 2039): `12 0 obj null endobj` → Integer(12), Integer(0), Obj, Null, EndObj, Eof +7. **Indirect reference** (line 2051): `5 0 R` → Integer(5), Integer(0), IndirectRef, Eof +8. **Case-sensitive keyword** (line 1134): `True` → Keyword(b"True"), Eof +9. **Stream header validation** (lines 1188-1221): + - `stream\nbody` → Stream, no diagnostics + - `stream\r\nbody` → Stream, no diagnostics + - `stream\rbody` → Stream + STRUCT_INVALID_STREAM_HEADER + - `stream body` → Stream + STRUCT_INVALID_STREAM_HEADER + +### Proptest Properties (INV-8 Compliance) + +- `proptest_random_bytes_never_panics` (line 2071): Verifies no panic on any input +- All lexer branches handle EOF gracefully +- Unknown keywords emit `Token::Keyword(bytes)` instead of panicking + +## Notes + +The lexer module compiles successfully. Full integration tests cannot run due to unrelated pre-existing compilation errors in other modules (missing `LZWDecoder`, `Diagnostic` type mismatches in catalog.rs, pages.rs, ocg.rs). These errors are not caused by this change.