- Add `.` to match pattern for numbers starting with decimal point - Fix bare sign handling to prevent infinite loops (+/- without digits) - Fix multiple dots detection using loop instead of single if - Add `)` delimiter handling to prevent infinite loops in proptests - Add comprehensive acceptance criteria tests for all numeric formats - Add proptest for numeric literal edge cases Acceptance criteria PASS: - 123 -> Integer(123) - -7 -> Integer(-7) - 3.14 -> Real(3.14) - -.5 -> Real(-0.5) - 42. -> Real(42.0) - .001 -> Real(0.001) - +0 -> Integer(0) - 1e5 -> Integer(1) + Keyword(b"e5") (no scientific notation) - Overflow -> Integer(i64::MAX) with STRUCT_INTEGER_OVERFLOW - --5 -> STRUCT_INVALID_NUMBER diagnostic - 1.2.3 -> STRUCT_INVALID_NUMBER diagnostic All 105 lexer tests pass including new proptest. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
4.6 KiB
4.6 KiB
pdftract-1jjn: PDF Numeric Literal Lexer Implementation
Summary
Implemented PDF numeric literal lexer for both integers and real numbers with full support for all edge cases specified in PDF spec 7.3.3.
Changes Made
1. Fixed existing numeric lexer bugs
- Added
.to match pattern: The lexer now correctly recognizes numbers starting with.(e.g.,.001) - Fixed bare sign handling: When a
+or-is not followed by any digit, the lexer now advances past the sign to prevent infinite loops - Fixed multiple dots detection: Changed from single
iftowhileloop to properly detect multiple dots in malformed input like1.2.3
2. Added ) delimiter handling
Added ) to the match statement in lex_next() to handle it as an unexpected byte when appearing outside of a string context. This prevents infinite loops in the proptest.
3. Added comprehensive tests
Added the following acceptance criteria tests:
numeric_integer_positive:123->Integer(123)✓numeric_integer_negative:-7->Integer(-7)✓numeric_real_simple:3.14->Real(3.14)✓numeric_real_negative_dot_then_digits:-.5->Real(-0.5)✓numeric_real_digits_then_dot:42.->Real(42.0)✓numeric_real_dot_then_digits:.001->Real(0.001)✓numeric_integer_positive_zero:+0->Integer(0)✓numeric_scientific_notation_rejected:1e5->Integer(1)followed byKeyword(b"e5")✓numeric_overflow_clamps_to_max: Overflow ->Integer(i64::MAX)withSTRUCT_INTEGER_OVERFLOW✓numeric_double_sign_emits_diagnostic:--5->STRUCT_INVALID_NUMBER✓numeric_multiple_dots_emits_diagnostic:1.2.3->STRUCT_INVALID_NUMBER✓numeric_bare_sign_emits_diagnostic:+alone ->STRUCT_INVALID_NUMBER✓numeric_hex_notation_not_supported:0xFF->Integer(0)+Keyword(b"xFF")✓numeric_real_negative_dot_then_digits_with_boundary: Boundary detection ✓proptest_numeric_never_panics: Property test for random byte sequences starting with numeric characters ✓
Acceptance Criteria Status
-
Critical tests: All PASS
123->Integer(123)✓-7->Integer(-7)✓3.14->Real(3.14)✓-.5->Real(-0.5)✓42.->Real(42.0)✓.001->Real(0.001)✓+0->Integer(0)✓1e5->Integer(1)followed byKeyword(b"e5")✓99999999999999999999(overflow) ->Integer(i64::MAX)withSTRUCT_INTEGER_OVERFLOW✓--5-> diagnosticSTRUCT_INVALID_NUMBER✓
-
proptest property: Random byte sequences starting with
+-0123456789.never panic ✓ -
INV-8 maintained: No unwrap/expect calls ✓
Implementation Details
The numeric lexer uses a state machine that:
- Consumes optional leading sign (
+or-) - Consumes digits before the decimal point (if any)
- Loops to consume decimal points and following digits (detects multiple dots)
- Validates at least one digit is present
- Validates at most one dot for real numbers
- Stops at whitespace or delimiters
- Does NOT support scientific notation (
e/Eterminates the number) - Handles overflow by clamping to
i64::MAXwith diagnostic - Handles parse failures by returning default values with diagnostics
Files Modified
crates/pdftract-core/src/parser/lexer/mod.rs: Fixed bugs and added testsnotes/pdftract-1jjn.md: This verification note
Test Results
All 105 lexer tests pass, including:
- All existing tests (regression check)
- All new numeric literal tests
- All proptests (including the new numeric proptest)
Retrospective
What worked
- The existing numeric lexer implementation was mostly correct, only needed minor fixes
- The loop-based approach for detecting multiple dots is clean and efficient
- The proptest approach caught the infinite loop bug with
)delimiter
What didn't
- Initial implementation didn't include
.in the match pattern, causing.001to be parsed as a keyword - The single-
ifapproach for dot handling missed multiple dots - Bare sign handling didn't advance the lexer, causing infinite loops
Surprise
- The proptest found an infinite loop bug with
)that would have been difficult to catch with unit tests alone - PDF spec specifically forbids scientific notation, which is different from most other numeric formats
Reusable pattern
- For future lexer work, always use property testing to catch infinite loop bugs
- When implementing state machines for tokenization, always ensure forward progress (advance the lexer) in all code paths
- Loop-based validation (e.g., for multiple dots) is more robust than single-pass checks