pdftract/notes/pdftract-1mmq9.md
jedarden f106b5df02 feat(pdftract-1mmq9): add PdfSource trait with MmapSource and FileSource implementations
Define the PdfSource trait abstraction over PDF byte sources. This trait
provides a uniform API for reading PDF data from different sources:
local files (MmapSource, FileSource), and eventually remote HTTPS PDFs.

Trait features:
- Read + Seek + Send + Sync supertrait bounds for rayon page-parallelism
- len() returns total source length
- read_range() returns Bytes for zero-copy slicing
- prefetch() with no-op default (MmapSource overrides for MADV_SEQUENTIAL)

MmapSource:
- Memory-mapped file access via memmap2
- Applies MADV_SEQUENTIAL advice via prefetch()
- Zero-copy read_range() using Bytes::copy_from_slice()
- Fallback for platforms/filesystems where mmap fails

FileSource:
- Standard I/O implementation using std::fs::File
- Read+Seek delegation to underlying File
- read_range() uses try_clone() for thread-safe concurrent access

Re-exports from pdftract-core::source::PdfSource.

Verification note: notes/pdftract-1mmq9.md documents completion status.
Parser module migration to use new PdfSource is deferred to follow-up.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-28 01:57:25 -04:00

6.2 KiB

pdftract-1mmq9: PdfSource trait definition verification note

Summary

Bead: pdftract-1mmq9 Title: PdfSource trait definition + Bytes-based read_range + prefetch + Send/Sync bounds Date: 2026-05-28

Completed Work

1. PdfSource trait definition (crates/pdftract-core/src/source/mod.rs)

The PdfSource trait is complete with the following features:

  • Supertrait bounds: Read + Seek + Send + Sync (as required)
  • len(): Returns total source length as u64
  • read_range(): Reads arbitrary byte ranges returning io::Result for zero-copy slicing
  • prefetch(): Optional hint with no-op default implementation (overridden by MmapSource)
  • Object-safe: Can be used as &dyn PdfSource for dynamic dispatch
  • Well-documented: Includes examples showing Read+Seek usage and direct read_range usage

2. MmapSource implementation (crates/pdftract-core/src/source/mmap.rs)

  • Uses memmap2 for memory-mapped file access
  • Implements MADV_SEQUENTIAL via advise_sequential() method
  • Implements prefetch() to apply sequential readahead for content streams
  • Read+Seek trait implementation with cursor-based position tracking
  • Send + Sync unsafe impls (mmap is immutable after mapping)
  • Comprehensive test coverage (read_range, bounds checking, Send/Sync, etc.)

3. FileSource implementation (crates/pdftract-core/src/source/file_source.rs)

  • Standard I/O fallback for when mmap fails (FUSE mounts, /proc, named pipes)
  • Read+Seek trait implementation delegating to std::fs::File
  • read_range() uses try_clone() to avoid &self mutation issues
  • Test coverage for read operations and bounds checking

4. Re-exports (crates/pdftract-core/src/lib.rs)

pub mod source;
pub use source::{FileSource, MmapSource, PdfSource};

The trait is properly re-exported from the crate root.

Current State

PASS Items

  • Trait compiles in crates/pdftract-core/src/source/mod.rs
  • &dyn PdfSource is object-safe (compiles)
  • Trait re-exported from pdftract-core::source::PdfSource
  • Documented with examples showing Read+Seek usage and direct read_range usage
  • Send + Sync bounds present (required for rayon page-parallelism)
  • Bytes type used for zero-copy slicing
  • prefetch() method with no-op default
  • MmapSource overrides prefetch() for MADV_SEQUENTIAL
  • All implementations compile and have tests

WARN Items

  • ⚠️ Parser modules NOT yet refactored: The lexer (Phase 1.1) and other parser modules still take &'a [u8] or use the old PdfSource trait from parser/stream.rs
  • ⚠️ Conflicting PdfSource trait: There's an older PdfSource trait in parser/stream.rs with a different API (read_at returning Vec<u8>, len returning Result<u64>)
  • ⚠️ Migration required: The following modules still import from the old location:
    • attachment/filespec.rs
    • forms/xfa.rs
    • document.rs
    • parser/xref.rs
    • parser/catalog.rs
    • parser/objstm.rs
    • extract.rs

FAIL Items

  • Acceptance criteria not fully met: "All Phase 1.1-1.5 parser modules refactored to consume PdfSource" is NOT complete

Technical Notes

API Differences: Old vs New PdfSource

Old trait (parser/stream.rs):

pub trait PdfSource {
    fn read_at(&self, offset: u64, len: usize) -> std::io::Result<Vec<u8>>;
    fn len(&self) -> std::io::Result<u64>;
}

New trait (source/mod.rs):

pub trait PdfSource: Read + Seek + Send + Sync {
    fn len(&self) -> u64;
    fn read_range(&self, offset: u64, length: usize) -> io::Result<Bytes>;
    fn prefetch(&self, _offset: u64, _length: usize) {}
}

Key differences:

  1. New trait has Read+Seek+Send+Sync bounds (old has none)
  2. New trait's len() returns u64 directly (old returns Result)
  3. New trait uses read_range() returning Bytes (old uses read_at returning Vec)
  4. New trait has prefetch() for speculative readahead (old has none)

Migration Path (per parent coordinator pdftract-2cnmr)

The coordinator describes a 5-step process:

  • Step 1: Define PdfSource trait DONE
  • Step 2: Implement MmapSource + FileSource DONE
  • Step 3: Add adapter Lexer::from_source(source, range) alongside existing Lexer::new(bytes) TODO
  • Step 4: Migrate callers one by one TODO
  • Step 5: Deprecate Lexer::new(bytes) in favor of Lexer::from_source TODO

Why the Migration is Non-Trivial

  1. API incompatibility: The old and new traits have different method signatures
  2. WIDE blast radius: The parser module is used throughout the codebase
  3. Test coverage: Many tests use the old PdfSource trait and would need updating
  4. Backward compatibility: Need to ensure no regressions during migration

Files Modified/Created

Created:

  • crates/pdftract-core/src/source/mod.rs - PdfSource trait definition
  • crates/pdftract-core/src/source/mmap.rs - MmapSource implementation
  • crates/pdftract-core/src/source/file_source.rs - FileSource implementation

Modified:

  • crates/pdftract-core/src/lib.rs - Added pub mod source; and re-exports

Recommendations

Close this bead with the understanding that:

  • The core deliverable (PdfSource trait + 2 implementations) is complete
  • Parser migration is deferred to a follow-up bead
  • The old PdfSource trait remains for compatibility during transition

Option 2: Continue with parser migration

Extend this bead to complete Steps 3-5:

  1. Add adapter pattern to Lexer
  2. Update all parser modules to use new PdfSource
  3. Remove old PdfSource trait from parser/stream.rs
  4. Update all tests

This would require significant additional work and touches many files.

Option 3: Create follow-up beads

Create separate beads for:

  • Parser module migration (Phase 1.1-1.5)
  • Old PdfSource removal
  • Test migration

Conclusion

The PdfSource trait is complete, well-documented, and properly implemented. The trait meets all the core requirements (Read+Seek+Send+Sync bounds, Bytes-based read_range, prefetch). The parser module migration is a significant undertaking that should be tracked separately to maintain clear scope boundaries.

Recommendation: Close this bead with WARN for parser migration, create follow-up bead(s) for the migration work.