fix(pdftract-5upi): correct keyword fallback in lexer
Fixed incorrect fallback behavior in keyword lexer functions. Four functions (lex_e_keyword, lex_o_keyword, lex_r_keyword, lex_n_keyword) were incorrectly calling lex_name() instead of lex_keyword() when keywords didn't match. When a PDF contains an unrecognized word starting with e/o/n/R (e.g., "endob" instead of "endobj"), the lexer should fall back to generic keyword parsing (Token::Keyword(bytes)), not name parsing. Names always start with /, so calling lex_name() on input without a leading / would incorrectly skip the first byte. References: - Bead: pdftract-5upi - Notes: notes/pdftract-5upi.md Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
This commit is contained in:
parent
52bcb16bf6
commit
fee6ed8afd
2 changed files with 71 additions and 92 deletions
|
|
@ -1033,8 +1033,8 @@ impl<'a> Lexer<'a> {
|
|||
return Some(Token::EndObj);
|
||||
}
|
||||
}
|
||||
// Not a recognized keyword, treat as name
|
||||
self.lex_name()
|
||||
// Not a recognized keyword, treat as generic keyword
|
||||
self.lex_keyword()
|
||||
}
|
||||
|
||||
fn lex_o_keyword(&mut self) -> Option<Token> {
|
||||
|
|
@ -1046,8 +1046,8 @@ impl<'a> Lexer<'a> {
|
|||
return Some(Token::Obj);
|
||||
}
|
||||
}
|
||||
// Not "obj", treat as name
|
||||
self.lex_name()
|
||||
// Not "obj", treat as generic keyword
|
||||
self.lex_keyword()
|
||||
}
|
||||
|
||||
fn lex_r_keyword(&mut self) -> Option<Token> {
|
||||
|
|
@ -1057,7 +1057,7 @@ impl<'a> Lexer<'a> {
|
|||
self.advance(1);
|
||||
Some(Token::IndirectRef)
|
||||
} else {
|
||||
self.lex_name()
|
||||
self.lex_keyword()
|
||||
}
|
||||
}
|
||||
|
||||
|
|
@ -1070,8 +1070,8 @@ impl<'a> Lexer<'a> {
|
|||
return Some(Token::Null);
|
||||
}
|
||||
}
|
||||
// Not "null", treat as name
|
||||
self.lex_name()
|
||||
// Not "null", treat as generic keyword
|
||||
self.lex_keyword()
|
||||
}
|
||||
|
||||
fn lex_unknown(&mut self) -> Option<Token> {
|
||||
|
|
|
|||
|
|
@ -1,97 +1,76 @@
|
|||
# pdftract-5upi: Structural Token Lexer
|
||||
# pdftract-5upi: Structural Token Lexer Bug Fix
|
||||
|
||||
## Summary
|
||||
|
||||
The structural token lexer was already fully implemented. This verification confirms that all acceptance criteria tests pass. The only change made was fixing a pre-existing compilation error in `xref.rs` by adding the missing `parse_obj_header_at_memory` function.
|
||||
Fixed incorrect fallback behavior in keyword lexer functions. Four functions (`lex_e_keyword`, `lex_o_keyword`, `lex_r_keyword`, `lex_n_keyword`) were incorrectly calling `lex_name()` instead of `lex_keyword()` when keywords didn't match, causing incorrect parsing of unrecognized keywords starting with those letters.
|
||||
|
||||
## Acceptance Criteria Status
|
||||
## Bug Description
|
||||
|
||||
### All Critical Tests PASS
|
||||
When a PDF contains an unrecognized word starting with `e`, `o`, `n`, or `R` (e.g., a typo like "endob" instead of "endobj"), the lexer should fall back to generic keyword parsing, which emits `Token::Keyword(bytes)`. Instead, these functions were calling `lex_name()`, which expects to parse a name object (always starting with `/`).
|
||||
|
||||
1. **Array delimiters** (`[1 2 3]`): `array_delimiters` test PASSED
|
||||
- ArrayStart, Integer(1), Integer(2), Integer(3), ArrayEnd, Eof
|
||||
|
||||
2. **Dict delimiters** (`<< /A 1 >>`): `dict_delimiters` test PASSED
|
||||
- DictStart, Name(b"A"), Integer(1), DictEnd, Eof
|
||||
|
||||
3. **Hex string not dict** (`<48>`): `hex_string_odd_length_single_nibble` test PASSED
|
||||
- String(b"\x48"), Eof — correctly dispatches `<` followed by non-`<` to hex lexer
|
||||
|
||||
4. **Dict start, hex string, dict end** (`<<<48>>>`): `hex_string_dict_start_hex_string_dict_end` test PASSED
|
||||
- DictStart, String(b"\x48"), DictEnd
|
||||
|
||||
5. **Boolean and null keywords** (`true false null`): `bool_literals` and `null_keyword` tests PASSED
|
||||
- Bool(true), Bool(false), Null, Eof
|
||||
|
||||
6. **Object keywords** (`12 0 obj null endobj`): `obj_keywords` test PASSED
|
||||
- Integer(12), Integer(0), Obj, Null, EndObj, Eof
|
||||
|
||||
7. **Indirect reference** (`5 0 R`): `indirect_ref_keyword` test PASSED
|
||||
- Integer(5), Integer(0), IndirectRef, Eof
|
||||
|
||||
8. **Stream keywords** (`stream\n...endstream`): `stream_keywords` and `stream_header_valid_line_endings` tests PASSED
|
||||
- Token::Stream, then Token::EndStream
|
||||
|
||||
9. **Invalid stream header** (`stream\rxxx`): `stream_header_lone_cr_emits_diagnostic` test PASSED
|
||||
- Token::Stream + `STRUCT_INVALID_STREAM_HEADER` diagnostic (lone `\r` is invalid)
|
||||
|
||||
10. **Case-mismatched keyword** (`True`): `bool_case_sensitive` test PASSED
|
||||
- Token::Keyword(b"True"), Eof (object parser will reject)
|
||||
|
||||
### Proptests PASS
|
||||
|
||||
- `proptest_hex_string_never_panics_on_random_bytes`: PASSED
|
||||
- `proptest_hex_string_roundtrip_via_reencode`: PASSED
|
||||
- `proptest_string_never_panics_on_random_bytes`: PASSED
|
||||
- `proptest_valid_string_roundtrips`: PASSED
|
||||
- `name_proptest_never_panics_on_random_bytes`: PASSED
|
||||
- `name_proptest_always_produces_valid_token`: PASSED
|
||||
|
||||
## Implementation Details
|
||||
|
||||
The structural token lexer dispatches from `next_token()` as follows:
|
||||
|
||||
- `[` / `]` → ArrayStart / ArrayEnd (direct return)
|
||||
- `<` → peek next byte: if `<`, return DictStart (advance 2); else hex string lexer
|
||||
- `>` → peek next byte: if `>`, return DictEnd (advance 2); else emit STRUCT_UNEXPECTED_BYTE
|
||||
- `t` → check for "true" (Bool(true)) or "trailer" (Keyword), else lex_keyword
|
||||
- `f` → check for "false" (Bool(false)), else lex_keyword
|
||||
- `n` → check for "null" (Null), else lex_name
|
||||
- `o` → check for "obj" (Obj), else lex_name
|
||||
- `e` → check for "endstream" (EndStream) or "endobj" (EndObj), else lex_name
|
||||
- `s` → check for "stream" (Stream with line ending validation) or "startxref" (Keyword)
|
||||
- `R` → IndirectRef
|
||||
- `x` → check for "xref" (Keyword)
|
||||
- `%` → check for "%%EOF" (Keyword) or skip comment
|
||||
|
||||
### Stream Header Validation
|
||||
|
||||
Per PDF spec 7.3.8.1, the `stream` keyword must be followed by `\n` or `\r\n`. A lone `\r` is INVALID:
|
||||
|
||||
```rust
|
||||
// In lex_s_keyword():
|
||||
if let Some(&b'\n') = self.bytes.first() {
|
||||
self.advance(1); // \n is valid
|
||||
} else if let Some(&b'\r') = self.bytes.first() {
|
||||
self.advance(1);
|
||||
if let Some(&b'\n') = self.bytes.first() {
|
||||
self.advance(1); // \r\n is valid
|
||||
} else {
|
||||
// Lone \r - emit STRUCT_INVALID_STREAM_HEADER
|
||||
}
|
||||
}
|
||||
```
|
||||
**Why this matters:** Names in PDF always start with `/`. The `lex_name()` function immediately advances past the leading `/`. When called on input like "endob" (no leading `/`), it would incorrectly skip the first byte and try to parse the rest as a name, producing wrong results.
|
||||
|
||||
## Changes Made
|
||||
|
||||
Fixed a pre-existing compilation error in `xref.rs` by adding the missing `parse_obj_header_at_memory` function. This function is a variant of `parse_obj_header_at` that works directly with a byte slice instead of a `PdfSource`, used by the `forward_scan_memory` function for efficient scanning of small files.
|
||||
File: `crates/pdftract-core/src/parser/lexer/mod.rs`
|
||||
|
||||
File: `crates/pdftract-core/src/parser/xref.rs`
|
||||
- Added `parse_obj_header_at_memory` function (lines 1120-1189)
|
||||
1. **`lex_e_keyword()` (line 1037)**: Changed `self.lex_name()` to `self.lex_keyword()`
|
||||
- Handles: "endstream", "endobj"
|
||||
- Fallback example: "endob" → `Token::Keyword(b"endob")` (was incorrectly parsed as name)
|
||||
|
||||
## INV-8 Status
|
||||
2. **`lex_o_keyword()` (line 1050)**: Changed `self.lex_name()` to `self.lex_keyword()`
|
||||
- Handles: "obj"
|
||||
- Fallback example: "ob" → `Token::Keyword(b"ob")` (was incorrectly parsed as name)
|
||||
|
||||
INV-8 (lexer never panics on invalid input) is maintained:
|
||||
- All proptests use random byte sequences and verify no panics
|
||||
- Every lexer branch handles EOF gracefully
|
||||
- Unknown keywords emit Token::Keyword instead of panicking
|
||||
3. **`lex_r_keyword()` (line 1060)**: Changed `self.lex_name()` to `self.lex_keyword()`
|
||||
- Handles: "R" (indirect reference)
|
||||
- Fallback example: "Ref" → `Token::Keyword(b"Ref")` (was incorrectly parsed as name)
|
||||
|
||||
4. **`lex_n_keyword()` (line 1074)**: Changed `self.lex_name()` to `self.lex_keyword()`
|
||||
- Handles: "null"
|
||||
- Fallback example: "nul" → `Token::Keyword(b"nul")` (was incorrectly parsed as name)
|
||||
|
||||
## Verification
|
||||
|
||||
### Pre-existing Implementation Status
|
||||
|
||||
The structural token lexer was already fully implemented. This fix only corrected the fallback behavior:
|
||||
|
||||
| Feature | Status | Location |
|
||||
|---------|--------|----------|
|
||||
| Array delimiters `[` `]` | ✅ | Lines 379-380 |
|
||||
| Dict delimiters `<<` `>>` | ✅ | Lines 866-870, 952-956 |
|
||||
| Boolean keywords `true` `false` | ✅ | Lines 474-505 |
|
||||
| Null keyword `null` | ✅ | Lines 1064-1074 |
|
||||
| Obj keywords `obj` `endobj` | ✅ | Lines 1028-1050 |
|
||||
| Stream keywords `stream` `endstream` | ✅ | Lines 969-1036 |
|
||||
| Indirect ref `R` | ✅ | Lines 1053-1061 |
|
||||
| Xref keywords `xref` `trailer` `startxref` | ✅ | Lines 483-518, 1007-1014 |
|
||||
| `%%EOF` marker | ✅ | Lines 521-528 |
|
||||
| Stream header validation (PDF 7.3.8.1) | ✅ | Lines 978-1002 |
|
||||
|
||||
### Acceptance Criteria Tests (all present in code)
|
||||
|
||||
1. **Array with integers** (line 2016): `[1 2 3]` → ArrayStart, Integer(1), Integer(2), Integer(3), ArrayEnd, Eof
|
||||
2. **Dict with name and integer** (line 2028): `<< /A 1 >>` → DictStart, Name(b"A"), Integer(1), DictEnd, Eof
|
||||
3. **Hex string (not dict)** (line 1437): `<48>` → String(b"\x48"), Eof
|
||||
4. **Dict-hex-dict ambiguity** (line 1576): `<<<48>>>` → DictStart, String(b"\x48"), DictEnd
|
||||
5. **Boolean and null** (line 2061): `true false null` → Bool(true), Bool(false), Null, Eof
|
||||
6. **Indirect object header** (line 2039): `12 0 obj null endobj` → Integer(12), Integer(0), Obj, Null, EndObj, Eof
|
||||
7. **Indirect reference** (line 2051): `5 0 R` → Integer(5), Integer(0), IndirectRef, Eof
|
||||
8. **Case-sensitive keyword** (line 1134): `True` → Keyword(b"True"), Eof
|
||||
9. **Stream header validation** (lines 1188-1221):
|
||||
- `stream\nbody` → Stream, no diagnostics
|
||||
- `stream\r\nbody` → Stream, no diagnostics
|
||||
- `stream\rbody` → Stream + STRUCT_INVALID_STREAM_HEADER
|
||||
- `stream body` → Stream + STRUCT_INVALID_STREAM_HEADER
|
||||
|
||||
### Proptest Properties (INV-8 Compliance)
|
||||
|
||||
- `proptest_random_bytes_never_panics` (line 2071): Verifies no panic on any input
|
||||
- All lexer branches handle EOF gracefully
|
||||
- Unknown keywords emit `Token::Keyword(bytes)` instead of panicking
|
||||
|
||||
## Notes
|
||||
|
||||
The lexer module compiles successfully. Full integration tests cannot run due to unrelated pre-existing compilation errors in other modules (missing `LZWDecoder`, `Diagnostic` type mismatches in catalog.rs, pages.rs, ocg.rs). These errors are not caused by this change.
|
||||
|
|
|
|||
Loading…
Add table
Reference in a new issue