Add documentation for the fix that removed diagnostic emission for unknown keywords, complementing the earlier keyword fallback fix. Co-Authored-By: Claude Code <noreply@anthropic.com>
5 KiB
pdftract-5upi: Structural Token Lexer Bug Fix
Summary
Fixed incorrect fallback behavior in keyword lexer functions. Four functions (lex_e_keyword, lex_o_keyword, lex_r_keyword, lex_n_keyword) were incorrectly calling lex_name() instead of lex_keyword() when keywords didn't match, causing incorrect parsing of unrecognized keywords starting with those letters.
Bug Description
When a PDF contains an unrecognized word starting with e, o, n, or R (e.g., a typo like "endob" instead of "endobj"), the lexer should fall back to generic keyword parsing, which emits Token::Keyword(bytes). Instead, these functions were calling lex_name(), which expects to parse a name object (always starting with /).
Why this matters: Names in PDF always start with /. The lex_name() function immediately advances past the leading /. When called on input like "endob" (no leading /), it would incorrectly skip the first byte and try to parse the rest as a name, producing wrong results.
Changes Made
File: crates/pdftract-core/src/parser/lexer/mod.rs
-
lex_e_keyword()(line 1037): Changedself.lex_name()toself.lex_keyword()- Handles: "endstream", "endobj"
- Fallback example: "endob" →
Token::Keyword(b"endob")(was incorrectly parsed as name)
-
lex_o_keyword()(line 1050): Changedself.lex_name()toself.lex_keyword()- Handles: "obj"
- Fallback example: "ob" →
Token::Keyword(b"ob")(was incorrectly parsed as name)
-
lex_r_keyword()(line 1060): Changedself.lex_name()toself.lex_keyword()- Handles: "R" (indirect reference)
- Fallback example: "Ref" →
Token::Keyword(b"Ref")(was incorrectly parsed as name)
-
lex_n_keyword()(line 1074): Changedself.lex_name()toself.lex_keyword()- Handles: "null"
- Fallback example: "nul" →
Token::Keyword(b"nul")(was incorrectly parsed as name)
Verification
Pre-existing Implementation Status
The structural token lexer was already fully implemented. This fix only corrected the fallback behavior:
| Feature | Status | Location |
|---|---|---|
Array delimiters [ ] |
✅ | Lines 379-380 |
Dict delimiters << >> |
✅ | Lines 866-870, 952-956 |
Boolean keywords true false |
✅ | Lines 474-505 |
Null keyword null |
✅ | Lines 1064-1074 |
Obj keywords obj endobj |
✅ | Lines 1028-1050 |
Stream keywords stream endstream |
✅ | Lines 969-1036 |
Indirect ref R |
✅ | Lines 1053-1061 |
Xref keywords xref trailer startxref |
✅ | Lines 483-518, 1007-1014 |
%%EOF marker |
✅ | Lines 521-528 |
| Stream header validation (PDF 7.3.8.1) | ✅ | Lines 978-1002 |
Acceptance Criteria Tests (all present in code)
- Array with integers (line 2016):
[1 2 3]→ ArrayStart, Integer(1), Integer(2), Integer(3), ArrayEnd, Eof - Dict with name and integer (line 2028):
<< /A 1 >>→ DictStart, Name(b"A"), Integer(1), DictEnd, Eof - Hex string (not dict) (line 1437):
<48>→ String(b"\x48"), Eof - Dict-hex-dict ambiguity (line 1576):
<<<48>>>→ DictStart, String(b"\x48"), DictEnd - Boolean and null (line 2061):
true false null→ Bool(true), Bool(false), Null, Eof - Indirect object header (line 2039):
12 0 obj null endobj→ Integer(12), Integer(0), Obj, Null, EndObj, Eof - Indirect reference (line 2051):
5 0 R→ Integer(5), Integer(0), IndirectRef, Eof - Case-sensitive keyword (line 1134):
True→ Keyword(b"True"), Eof - Stream header validation (lines 1188-1221):
stream\nbody→ Stream, no diagnosticsstream\r\nbody→ Stream, no diagnosticsstream\rbody→ Stream + STRUCT_INVALID_STREAM_HEADERstream body→ Stream + STRUCT_INVALID_STREAM_HEADER
Proptest Properties (INV-8 Compliance)
proptest_random_bytes_never_panics(line 2071): Verifies no panic on any input- All lexer branches handle EOF gracefully
- Unknown keywords emit
Token::Keyword(bytes)instead of panicking
Additional Bug Fix (2026-05-20)
Commit: 7818f22 - fix(pdftract-5upi): remove diagnostic emission for unknown keywords
Issue: The lex_keyword() function was incorrectly emitting StructUnexpectedByte diagnostics for unknown keywords.
Fix: Removed diagnostic emission from lex_keyword() function (lines 540-564).
Rationale:
- Many valid keywords (trailer, xref, etc.) are not in the initial dispatch table
- The object parser is responsible for validating keywords against known operators
- Emitting diagnostics here causes false positives for valid PDF constructs
This change aligns with the task requirement that unknown keywords emit Token::Keyword without a diagnostic, letting the object parser handle STRUCT_UNKNOWN_KEYWORD if needed.
Notes
The lexer module compiles successfully. Full integration tests cannot run due to unrelated pre-existing compilation errors in other modules (missing LZWDecoder, Diagnostic type mismatches in catalog.rs, pages.rs, ocg.rs). These errors are not caused by this change.