- Add `.` to match pattern for numbers starting with decimal point - Fix bare sign handling to prevent infinite loops (+/- without digits) - Fix multiple dots detection using loop instead of single if - Add `)` delimiter handling to prevent infinite loops in proptests - Add comprehensive acceptance criteria tests for all numeric formats - Add proptest for numeric literal edge cases Acceptance criteria PASS: - 123 -> Integer(123) - -7 -> Integer(-7) - 3.14 -> Real(3.14) - -.5 -> Real(-0.5) - 42. -> Real(42.0) - .001 -> Real(0.001) - +0 -> Integer(0) - 1e5 -> Integer(1) + Keyword(b"e5") (no scientific notation) - Overflow -> Integer(i64::MAX) with STRUCT_INTEGER_OVERFLOW - --5 -> STRUCT_INVALID_NUMBER diagnostic - 1.2.3 -> STRUCT_INVALID_NUMBER diagnostic All 105 lexer tests pass including new proptest. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
99 lines
4.6 KiB
Markdown
99 lines
4.6 KiB
Markdown
# pdftract-1jjn: PDF Numeric Literal Lexer Implementation
|
|
|
|
## Summary
|
|
|
|
Implemented PDF numeric literal lexer for both integers and real numbers with full support for all edge cases specified in PDF spec 7.3.3.
|
|
|
|
## Changes Made
|
|
|
|
### 1. Fixed existing numeric lexer bugs
|
|
|
|
- **Added `.` to match pattern**: The lexer now correctly recognizes numbers starting with `.` (e.g., `.001`)
|
|
- **Fixed bare sign handling**: When a `+` or `-` is not followed by any digit, the lexer now advances past the sign to prevent infinite loops
|
|
- **Fixed multiple dots detection**: Changed from single `if` to `while` loop to properly detect multiple dots in malformed input like `1.2.3`
|
|
|
|
### 2. Added `)` delimiter handling
|
|
|
|
Added `)` to the match statement in `lex_next()` to handle it as an unexpected byte when appearing outside of a string context. This prevents infinite loops in the proptest.
|
|
|
|
### 3. Added comprehensive tests
|
|
|
|
Added the following acceptance criteria tests:
|
|
- `numeric_integer_positive`: `123` -> `Integer(123)` ✓
|
|
- `numeric_integer_negative`: `-7` -> `Integer(-7)` ✓
|
|
- `numeric_real_simple`: `3.14` -> `Real(3.14)` ✓
|
|
- `numeric_real_negative_dot_then_digits`: `-.5` -> `Real(-0.5)` ✓
|
|
- `numeric_real_digits_then_dot`: `42.` -> `Real(42.0)` ✓
|
|
- `numeric_real_dot_then_digits`: `.001` -> `Real(0.001)` ✓
|
|
- `numeric_integer_positive_zero`: `+0` -> `Integer(0)` ✓
|
|
- `numeric_scientific_notation_rejected`: `1e5` -> `Integer(1)` followed by `Keyword(b"e5")` ✓
|
|
- `numeric_overflow_clamps_to_max`: Overflow -> `Integer(i64::MAX)` with `STRUCT_INTEGER_OVERFLOW` ✓
|
|
- `numeric_double_sign_emits_diagnostic`: `--5` -> `STRUCT_INVALID_NUMBER` ✓
|
|
- `numeric_multiple_dots_emits_diagnostic`: `1.2.3` -> `STRUCT_INVALID_NUMBER` ✓
|
|
- `numeric_bare_sign_emits_diagnostic`: `+` alone -> `STRUCT_INVALID_NUMBER` ✓
|
|
- `numeric_hex_notation_not_supported`: `0xFF` -> `Integer(0)` + `Keyword(b"xFF")` ✓
|
|
- `numeric_real_negative_dot_then_digits_with_boundary`: Boundary detection ✓
|
|
- `proptest_numeric_never_panics`: Property test for random byte sequences starting with numeric characters ✓
|
|
|
|
## Acceptance Criteria Status
|
|
|
|
- **Critical tests**: All PASS
|
|
- `123` -> `Integer(123)` ✓
|
|
- `-7` -> `Integer(-7)` ✓
|
|
- `3.14` -> `Real(3.14)` ✓
|
|
- `-.5` -> `Real(-0.5)` ✓
|
|
- `42.` -> `Real(42.0)` ✓
|
|
- `.001` -> `Real(0.001)` ✓
|
|
- `+0` -> `Integer(0)` ✓
|
|
- `1e5` -> `Integer(1)` followed by `Keyword(b"e5")` ✓
|
|
- `99999999999999999999` (overflow) -> `Integer(i64::MAX)` with `STRUCT_INTEGER_OVERFLOW` ✓
|
|
- `--5` -> diagnostic `STRUCT_INVALID_NUMBER` ✓
|
|
|
|
- **proptest property**: Random byte sequences starting with `+-0123456789.` never panic ✓
|
|
- **INV-8 maintained**: No unwrap/expect calls ✓
|
|
|
|
## Implementation Details
|
|
|
|
The numeric lexer uses a state machine that:
|
|
1. Consumes optional leading sign (`+` or `-`)
|
|
2. Consumes digits before the decimal point (if any)
|
|
3. Loops to consume decimal points and following digits (detects multiple dots)
|
|
4. Validates at least one digit is present
|
|
5. Validates at most one dot for real numbers
|
|
6. Stops at whitespace or delimiters
|
|
7. Does NOT support scientific notation (`e`/`E` terminates the number)
|
|
8. Handles overflow by clamping to `i64::MAX` with diagnostic
|
|
9. Handles parse failures by returning default values with diagnostics
|
|
|
|
## Files Modified
|
|
|
|
- `crates/pdftract-core/src/parser/lexer/mod.rs`: Fixed bugs and added tests
|
|
- `notes/pdftract-1jjn.md`: This verification note
|
|
|
|
## Test Results
|
|
|
|
All 105 lexer tests pass, including:
|
|
- All existing tests (regression check)
|
|
- All new numeric literal tests
|
|
- All proptests (including the new numeric proptest)
|
|
|
|
## Retrospective
|
|
|
|
### What worked
|
|
- The existing numeric lexer implementation was mostly correct, only needed minor fixes
|
|
- The loop-based approach for detecting multiple dots is clean and efficient
|
|
- The proptest approach caught the infinite loop bug with `)` delimiter
|
|
|
|
### What didn't
|
|
- Initial implementation didn't include `.` in the match pattern, causing `.001` to be parsed as a keyword
|
|
- The single-`if` approach for dot handling missed multiple dots
|
|
- Bare sign handling didn't advance the lexer, causing infinite loops
|
|
|
|
### Surprise
|
|
- The proptest found an infinite loop bug with `)` that would have been difficult to catch with unit tests alone
|
|
- PDF spec specifically forbids scientific notation, which is different from most other numeric formats
|
|
|
|
### Reusable pattern
|
|
- For future lexer work, always use property testing to catch infinite loop bugs
|
|
- When implementing state machines for tokenization, always ensure forward progress (advance the lexer) in all code paths
|
|
- Loop-based validation (e.g., for multiple dots) is more robust than single-pass checks
|