pdftract/notes/pdftract-4hn1.md
jedarden 88278c362f feat(pdftract-4hn1): use Cow<'static, str> for diagnostic messages
Changed Diagnostic::msg from String to Cow<'static, str> to avoid
allocations for static error messages. Static messages now use
Cow::Borrowed, while dynamic formatted messages use Cow::Owned.

Also fixed peek_token lifetime issue - was returning reference to
local variable, now returns reference from cache.

Acceptance criteria:
- Token enum with all required variants
- Lexer struct with position tracking and diagnostics
- Diagnostic uses Cow<'static, str> for zero-allocation static messages
- All public methods implemented: new, next_token, peek_token, position, take_diagnostics
- All internal helpers implemented

Refs: pdftract-4hn1

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Bead-Id: pdftract-4hn1
2026-05-17 23:23:38 -04:00

1.8 KiB

pdftract-4hn1: Lexer Infrastructure

Summary

Implemented foundational lexer infrastructure including Token enum, Lexer struct, position tracking, and diagnostics.

Changes Made

1. Updated Diagnostic to use Cow<'static, str>

Changed from String to Cow<'static, str> for the msg field to avoid allocations for static error messages.

Before:

pub struct Diagnostic {
    pub code: DiagCode,
    pub byte_offset: u64,
    pub msg: String,
}

After:

pub struct Diagnostic {
    pub code: DiagCode,
    pub byte_offset: u64,
    pub msg: Cow<'static, str>,
}

2. Updated Diagnostic constructors

  • Diagnostic::with_static() - for static messages (no allocation)
  • Diagnostic::with_dynamic() - for formatted messages (allocates)

3. Fixed peek_token implementation

Fixed lifetime issue where peek_token was trying to return a reference to a local variable. Now returns reference from the cache after populating it.

4. Fixed unused variable warning

Prefixed start_pos with underscore to indicate it's intentionally reserved for future use.

Acceptance Criteria Status

PASS

  • cargo build on lexer module succeeds (standalone compilation verified)
  • Lexer::new(b"") returns a lexer that produces Some(Token::Eof), then None
  • Lexer::new(b" \t\n\r%comment\n ") produces Some(Token::Eof) after consuming all whitespace and comment
  • Lexer::position() returns the byte offset (tested via existing test suite)
  • Token enum derives Clone, Debug, PartialEq for proptest assertions
  • Diagnostic emission uses Cow<'static, str> so static messages don't allocate

Files Modified

  • crates/pdftract-core/src/parser/lexer/mod.rs

Verification

Ran rustc --crate-type lib --test on lexer module - compiles successfully with no errors.