pdftract/notes/pdftract-1jjn.md
jedarden f1c7f1296e feat(pdftract-1jjn): implement PDF numeric literal lexer with full edge case support
- Add `.` to match pattern for numbers starting with decimal point
- Fix bare sign handling to prevent infinite loops (+/- without digits)
- Fix multiple dots detection using loop instead of single if
- Add `)` delimiter handling to prevent infinite loops in proptests
- Add comprehensive acceptance criteria tests for all numeric formats
- Add proptest for numeric literal edge cases

Acceptance criteria PASS:
- 123 -> Integer(123)
- -7 -> Integer(-7)
- 3.14 -> Real(3.14)
- -.5 -> Real(-0.5)
- 42. -> Real(42.0)
- .001 -> Real(0.001)
- +0 -> Integer(0)
- 1e5 -> Integer(1) + Keyword(b"e5") (no scientific notation)
- Overflow -> Integer(i64::MAX) with STRUCT_INTEGER_OVERFLOW
- --5 -> STRUCT_INVALID_NUMBER diagnostic
- 1.2.3 -> STRUCT_INVALID_NUMBER diagnostic

All 105 lexer tests pass including new proptest.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-23 23:17:04 -04:00

4.6 KiB

pdftract-1jjn: PDF Numeric Literal Lexer Implementation

Summary

Implemented PDF numeric literal lexer for both integers and real numbers with full support for all edge cases specified in PDF spec 7.3.3.

Changes Made

1. Fixed existing numeric lexer bugs

  • Added . to match pattern: The lexer now correctly recognizes numbers starting with . (e.g., .001)
  • Fixed bare sign handling: When a + or - is not followed by any digit, the lexer now advances past the sign to prevent infinite loops
  • Fixed multiple dots detection: Changed from single if to while loop to properly detect multiple dots in malformed input like 1.2.3

2. Added ) delimiter handling

Added ) to the match statement in lex_next() to handle it as an unexpected byte when appearing outside of a string context. This prevents infinite loops in the proptest.

3. Added comprehensive tests

Added the following acceptance criteria tests:

  • numeric_integer_positive: 123 -> Integer(123)
  • numeric_integer_negative: -7 -> Integer(-7)
  • numeric_real_simple: 3.14 -> Real(3.14)
  • numeric_real_negative_dot_then_digits: -.5 -> Real(-0.5)
  • numeric_real_digits_then_dot: 42. -> Real(42.0)
  • numeric_real_dot_then_digits: .001 -> Real(0.001)
  • numeric_integer_positive_zero: +0 -> Integer(0)
  • numeric_scientific_notation_rejected: 1e5 -> Integer(1) followed by Keyword(b"e5")
  • numeric_overflow_clamps_to_max: Overflow -> Integer(i64::MAX) with STRUCT_INTEGER_OVERFLOW
  • numeric_double_sign_emits_diagnostic: --5 -> STRUCT_INVALID_NUMBER
  • numeric_multiple_dots_emits_diagnostic: 1.2.3 -> STRUCT_INVALID_NUMBER
  • numeric_bare_sign_emits_diagnostic: + alone -> STRUCT_INVALID_NUMBER
  • numeric_hex_notation_not_supported: 0xFF -> Integer(0) + Keyword(b"xFF")
  • numeric_real_negative_dot_then_digits_with_boundary: Boundary detection ✓
  • proptest_numeric_never_panics: Property test for random byte sequences starting with numeric characters ✓

Acceptance Criteria Status

  • Critical tests: All PASS

    • 123 -> Integer(123)
    • -7 -> Integer(-7)
    • 3.14 -> Real(3.14)
    • -.5 -> Real(-0.5)
    • 42. -> Real(42.0)
    • .001 -> Real(0.001)
    • +0 -> Integer(0)
    • 1e5 -> Integer(1) followed by Keyword(b"e5")
    • 99999999999999999999 (overflow) -> Integer(i64::MAX) with STRUCT_INTEGER_OVERFLOW
    • --5 -> diagnostic STRUCT_INVALID_NUMBER
  • proptest property: Random byte sequences starting with +-0123456789. never panic ✓

  • INV-8 maintained: No unwrap/expect calls ✓

Implementation Details

The numeric lexer uses a state machine that:

  1. Consumes optional leading sign (+ or -)
  2. Consumes digits before the decimal point (if any)
  3. Loops to consume decimal points and following digits (detects multiple dots)
  4. Validates at least one digit is present
  5. Validates at most one dot for real numbers
  6. Stops at whitespace or delimiters
  7. Does NOT support scientific notation (e/E terminates the number)
  8. Handles overflow by clamping to i64::MAX with diagnostic
  9. Handles parse failures by returning default values with diagnostics

Files Modified

  • crates/pdftract-core/src/parser/lexer/mod.rs: Fixed bugs and added tests
  • notes/pdftract-1jjn.md: This verification note

Test Results

All 105 lexer tests pass, including:

  • All existing tests (regression check)
  • All new numeric literal tests
  • All proptests (including the new numeric proptest)

Retrospective

What worked

  • The existing numeric lexer implementation was mostly correct, only needed minor fixes
  • The loop-based approach for detecting multiple dots is clean and efficient
  • The proptest approach caught the infinite loop bug with ) delimiter

What didn't

  • Initial implementation didn't include . in the match pattern, causing .001 to be parsed as a keyword
  • The single-if approach for dot handling missed multiple dots
  • Bare sign handling didn't advance the lexer, causing infinite loops

Surprise

  • The proptest found an infinite loop bug with ) that would have been difficult to catch with unit tests alone
  • PDF spec specifically forbids scientific notation, which is different from most other numeric formats

Reusable pattern

  • For future lexer work, always use property testing to catch infinite loop bugs
  • When implementing state machines for tokenization, always ensure forward progress (advance the lexer) in all code paths
  • Loop-based validation (e.g., for multiple dots) is more robust than single-pass checks