The structural token lexer was already fully implemented. All 84 lexer tests pass, covering all acceptance criteria: - Array/dict delimiters ([], <<>>) - Keywords (true, false, null, obj, endobj, stream, endstream, R) - Hex string vs dict ambiguity (< vs <<) - Stream header validation (\n or \r\n only, lone \r is invalid) - Case-sensitive keyword matching This commit fixes a pre-existing compilation error in xref.rs where forward_scan_memory() called parse_obj_header_at_memory() which didn't exist. Added the missing function as a byte-slice variant of parse_obj_header_at() for efficient memory-based scanning. Verification: notes/pdftract-5upi.md Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
4.1 KiB
pdftract-5upi: Structural Token Lexer
Summary
The structural token lexer was already fully implemented. This verification confirms that all acceptance criteria tests pass. The only change made was fixing a pre-existing compilation error in xref.rs by adding the missing parse_obj_header_at_memory function.
Acceptance Criteria Status
All Critical Tests PASS
-
Array delimiters (
[1 2 3]):array_delimiterstest PASSED- ArrayStart, Integer(1), Integer(2), Integer(3), ArrayEnd, Eof
-
Dict delimiters (
<< /A 1 >>):dict_delimiterstest PASSED- DictStart, Name(b"A"), Integer(1), DictEnd, Eof
-
Hex string not dict (
<48>):hex_string_odd_length_single_nibbletest PASSED- String(b"\x48"), Eof — correctly dispatches
<followed by non-<to hex lexer
- String(b"\x48"), Eof — correctly dispatches
-
Dict start, hex string, dict end (
<<<48>>>):hex_string_dict_start_hex_string_dict_endtest PASSED- DictStart, String(b"\x48"), DictEnd
-
Boolean and null keywords (
true false null):bool_literalsandnull_keywordtests PASSED- Bool(true), Bool(false), Null, Eof
-
Object keywords (
12 0 obj null endobj):obj_keywordstest PASSED- Integer(12), Integer(0), Obj, Null, EndObj, Eof
-
Indirect reference (
5 0 R):indirect_ref_keywordtest PASSED- Integer(5), Integer(0), IndirectRef, Eof
-
Stream keywords (
stream\n...endstream):stream_keywordsandstream_header_valid_line_endingstests PASSED- Token::Stream, then Token::EndStream
-
Invalid stream header (
stream\rxxx):stream_header_lone_cr_emits_diagnostictest PASSED- Token::Stream +
STRUCT_INVALID_STREAM_HEADERdiagnostic (lone\ris invalid)
- Token::Stream +
-
Case-mismatched keyword (
True):bool_case_sensitivetest PASSED- Token::Keyword(b"True"), Eof (object parser will reject)
Proptests PASS
proptest_hex_string_never_panics_on_random_bytes: PASSEDproptest_hex_string_roundtrip_via_reencode: PASSEDproptest_string_never_panics_on_random_bytes: PASSEDproptest_valid_string_roundtrips: PASSEDname_proptest_never_panics_on_random_bytes: PASSEDname_proptest_always_produces_valid_token: PASSED
Implementation Details
The structural token lexer dispatches from next_token() as follows:
[/]→ ArrayStart / ArrayEnd (direct return)<→ peek next byte: if<, return DictStart (advance 2); else hex string lexer>→ peek next byte: if>, return DictEnd (advance 2); else emit STRUCT_UNEXPECTED_BYTEt→ check for "true" (Bool(true)) or "trailer" (Keyword), else lex_keywordf→ check for "false" (Bool(false)), else lex_keywordn→ check for "null" (Null), else lex_nameo→ check for "obj" (Obj), else lex_namee→ check for "endstream" (EndStream) or "endobj" (EndObj), else lex_names→ check for "stream" (Stream with line ending validation) or "startxref" (Keyword)R→ IndirectRefx→ check for "xref" (Keyword)%→ check for "%%EOF" (Keyword) or skip comment
Stream Header Validation
Per PDF spec 7.3.8.1, the stream keyword must be followed by \n or \r\n. A lone \r is INVALID:
// In lex_s_keyword():
if let Some(&b'\n') = self.bytes.first() {
self.advance(1); // \n is valid
} else if let Some(&b'\r') = self.bytes.first() {
self.advance(1);
if let Some(&b'\n') = self.bytes.first() {
self.advance(1); // \r\n is valid
} else {
// Lone \r - emit STRUCT_INVALID_STREAM_HEADER
}
}
Changes Made
Fixed a pre-existing compilation error in xref.rs by adding the missing parse_obj_header_at_memory function. This function is a variant of parse_obj_header_at that works directly with a byte slice instead of a PdfSource, used by the forward_scan_memory function for efficient scanning of small files.
File: crates/pdftract-core/src/parser/xref.rs
- Added
parse_obj_header_at_memoryfunction (lines 1120-1189)
INV-8 Status
INV-8 (lexer never panics on invalid input) is maintained:
- All proptests use random byte sequences and verify no panics
- Every lexer branch handles EOF gracefully
- Unknown keywords emit Token::Keyword instead of panicking