jedarden f1c7f1296e feat(pdftract-1jjn): implement PDF numeric literal lexer with full edge case support

- Add `.` to match pattern for numbers starting with decimal point
- Fix bare sign handling to prevent infinite loops (+/- without digits)
- Fix multiple dots detection using loop instead of single if
- Add `)` delimiter handling to prevent infinite loops in proptests
- Add comprehensive acceptance criteria tests for all numeric formats
- Add proptest for numeric literal edge cases

Acceptance criteria PASS:
- 123 -> Integer(123)
- -7 -> Integer(-7)
- 3.14 -> Real(3.14)
- -.5 -> Real(-0.5)
- 42. -> Real(42.0)
- .001 -> Real(0.001)
- +0 -> Integer(0)
- 1e5 -> Integer(1) + Keyword(b"e5") (no scientific notation)
- Overflow -> Integer(i64::MAX) with STRUCT_INTEGER_OVERFLOW
- --5 -> STRUCT_INVALID_NUMBER diagnostic
- 1.2.3 -> STRUCT_INVALID_NUMBER diagnostic

All 105 lexer tests pass including new proptest.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

2026-05-23 23:17:04 -04:00

4.6 KiB

Raw Blame History

pdftract-1jjn: PDF Numeric Literal Lexer Implementation

Summary

Implemented PDF numeric literal lexer for both integers and real numbers with full support for all edge cases specified in PDF spec 7.3.3.

Changes Made

1. Fixed existing numeric lexer bugs

Added . to match pattern: The lexer now correctly recognizes numbers starting with . (e.g., .001)
Fixed bare sign handling: When a + or - is not followed by any digit, the lexer now advances past the sign to prevent infinite loops
Fixed multiple dots detection: Changed from single if to while loop to properly detect multiple dots in malformed input like 1.2.3

2. Added `)` delimiter handling

Added ) to the match statement in lex_next() to handle it as an unexpected byte when appearing outside of a string context. This prevents infinite loops in the proptest.

3. Added comprehensive tests

Added the following acceptance criteria tests:

numeric_integer_positive: 123 -> Integer(123) ✓
numeric_integer_negative: -7 -> Integer(-7) ✓
numeric_real_simple: 3.14 -> Real(3.14) ✓
numeric_real_negative_dot_then_digits: -.5 -> Real(-0.5) ✓
numeric_real_digits_then_dot: 42. -> Real(42.0) ✓
numeric_real_dot_then_digits: .001 -> Real(0.001) ✓
numeric_integer_positive_zero: +0 -> Integer(0) ✓
numeric_scientific_notation_rejected: 1e5 -> Integer(1) followed by Keyword(b"e5") ✓
numeric_overflow_clamps_to_max: Overflow -> Integer(i64::MAX) with STRUCT_INTEGER_OVERFLOW ✓
numeric_double_sign_emits_diagnostic: --5 -> STRUCT_INVALID_NUMBER ✓
numeric_multiple_dots_emits_diagnostic: 1.2.3 -> STRUCT_INVALID_NUMBER ✓
numeric_bare_sign_emits_diagnostic: + alone -> STRUCT_INVALID_NUMBER ✓
numeric_hex_notation_not_supported: 0xFF -> Integer(0) + Keyword(b"xFF") ✓
numeric_real_negative_dot_then_digits_with_boundary: Boundary detection ✓
proptest_numeric_never_panics: Property test for random byte sequences starting with numeric characters ✓

Acceptance Criteria Status

Critical tests: All PASS
- 123 -> Integer(123) ✓
- -7 -> Integer(-7) ✓
- 3.14 -> Real(3.14) ✓
- -.5 -> Real(-0.5) ✓
- 42. -> Real(42.0) ✓
- .001 -> Real(0.001) ✓
- +0 -> Integer(0) ✓
- 1e5 -> Integer(1) followed by Keyword(b"e5") ✓
- 99999999999999999999 (overflow) -> Integer(i64::MAX) with STRUCT_INTEGER_OVERFLOW ✓
- --5 -> diagnostic STRUCT_INVALID_NUMBER ✓
proptest property: Random byte sequences starting with +-0123456789. never panic ✓
INV-8 maintained: No unwrap/expect calls ✓

Implementation Details

The numeric lexer uses a state machine that:

Consumes optional leading sign (+ or -)
Consumes digits before the decimal point (if any)
Loops to consume decimal points and following digits (detects multiple dots)
Validates at least one digit is present
Validates at most one dot for real numbers
Stops at whitespace or delimiters
Does NOT support scientific notation (e/E terminates the number)
Handles overflow by clamping to i64::MAX with diagnostic
Handles parse failures by returning default values with diagnostics

Files Modified

crates/pdftract-core/src/parser/lexer/mod.rs: Fixed bugs and added tests
notes/pdftract-1jjn.md: This verification note

Test Results

All 105 lexer tests pass, including:

All existing tests (regression check)
All new numeric literal tests
All proptests (including the new numeric proptest)

Retrospective

What worked

The existing numeric lexer implementation was mostly correct, only needed minor fixes
The loop-based approach for detecting multiple dots is clean and efficient
The proptest approach caught the infinite loop bug with ) delimiter

What didn't

Initial implementation didn't include . in the match pattern, causing .001 to be parsed as a keyword
The single-if approach for dot handling missed multiple dots
Bare sign handling didn't advance the lexer, causing infinite loops

Surprise

The proptest found an infinite loop bug with ) that would have been difficult to catch with unit tests alone
PDF spec specifically forbids scientific notation, which is different from most other numeric formats

Reusable pattern

For future lexer work, always use property testing to catch infinite loop bugs
When implementing state machines for tokenization, always ensure forward progress (advance the lexer) in all code paths
Loop-based validation (e.g., for multiple dots) is more robust than single-pass checks

4.6 KiB Raw Blame History