Commit graph

3 commits

Author SHA1 Message Date
jedarden
0b838de6cc docs(pdftract-5upi): update verification note with additional bug fix
Add documentation for the fix that removed diagnostic emission for
unknown keywords, complementing the earlier keyword fallback fix.

Co-Authored-By: Claude Code <noreply@anthropic.com>
2026-05-20 22:05:17 -04:00
jedarden
fee6ed8afd fix(pdftract-5upi): correct keyword fallback in lexer
Fixed incorrect fallback behavior in keyword lexer functions. Four
functions (lex_e_keyword, lex_o_keyword, lex_r_keyword, lex_n_keyword)
were incorrectly calling lex_name() instead of lex_keyword() when
keywords didn't match.

When a PDF contains an unrecognized word starting with e/o/n/R
(e.g., "endob" instead of "endobj"), the lexer should fall back to
generic keyword parsing (Token::Keyword(bytes)), not name parsing.
Names always start with /, so calling lex_name() on input without
a leading / would incorrectly skip the first byte.

References:
- Bead: pdftract-5upi
- Notes: notes/pdftract-5upi.md

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-20 21:55:55 -04:00
jedarden
a88353069a fix(pdftract-5upi): add parse_obj_header_at_memory for xref forward scan
The structural token lexer was already fully implemented. All 84 lexer
tests pass, covering all acceptance criteria:

- Array/dict delimiters ([], <<>>)
- Keywords (true, false, null, obj, endobj, stream, endstream, R)
- Hex string vs dict ambiguity (< vs <<)
- Stream header validation (\n or \r\n only, lone \r is invalid)
- Case-sensitive keyword matching

This commit fixes a pre-existing compilation error in xref.rs where
forward_scan_memory() called parse_obj_header_at_memory() which didn't
exist. Added the missing function as a byte-slice variant of
parse_obj_header_at() for efficient memory-based scanning.

Verification: notes/pdftract-5upi.md

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-18 02:54:35 -04:00