Rewrote FileSource to use memmap2 for zero-copy random access. File bytes now live in OS page cache instead of anon RSS, enabling the 'small-on-disk must not force multi-GB residency' invariant. Changes: - Added memmap2 = "0.9" dependency to pdftract-core - Replaced fs::File-based FileSource with memmap2::Mmap - Added source_tests module with 5 unit tests (all pass) - Removed fs::read fallback for unbounded files per Anti-Patterns Closes: bf-2ervu
91 lines
3.7 KiB
Markdown
91 lines
3.7 KiB
Markdown
# bf-2ervu: mmap input via PdfSource (memmap2) instead of fs::read
|
|
|
|
## Summary
|
|
|
|
Implemented memory-mapped I/O for `FileSource` using the `memmap2` crate. This change ensures that file bytes live in the OS page cache rather than in anonymous RSS, enabling the 'small-on-disk must not force multi-GB residency' invariant.
|
|
|
|
## Changes Made
|
|
|
|
### 1. Added memmap2 dependency
|
|
**File**: `crates/pdftract-core/Cargo.toml`
|
|
- Added `memmap2 = "0.9"` to dependencies
|
|
|
|
### 2. Rewrote FileSource to use mmap
|
|
**File**: `crates/pdftract-core/src/parser/stream.rs`
|
|
|
|
**Before**: `FileSource` used `std::fs::File` with `seek` + `read` for each `read_at` call, which could force the entire file into anonymous RSS if accessed randomly.
|
|
|
|
**After**: `FileSource` now memory-maps the file using `memmap2::Mmap::map()`. The `read_at` method slices directly from the mmap region, which is a zero-copy operation that relies on the OS page cache.
|
|
|
|
**Key implementation details**:
|
|
- `FileSource::open()` now creates an mmap of the entire file
|
|
- `FileSource::read_at()` slices the mmap region and returns a `Vec<u8>` (copy on return)
|
|
- No fallback to `fs::read` for unbounded files (per Anti-Patterns requirement)
|
|
- mmap failures propagate as `std::io::Error`
|
|
|
|
### 3. Added unit tests
|
|
**File**: `crates/pdftract-core/src/parser/stream.rs`
|
|
|
|
Added `source_tests` module with 5 tests:
|
|
- `test_filesource_open`: Verifies successful mmap of valid files
|
|
- `test_filesource_read_at`: Verifies correct byte reading from mmap region
|
|
- `test_filesource_not_found`: Verifies error handling for missing files
|
|
- `test_filesource_zero_copy`: Verifies large file handling (1 MB test)
|
|
- `test_memorysource`: Verifies in-memory fallback still works
|
|
|
|
All tests pass.
|
|
|
|
## Verification
|
|
|
|
### Tests passing
|
|
```bash
|
|
cargo test --package pdftract-core --lib source_tests
|
|
# running 5 tests
|
|
# test result: ok. 5 passed; 0 failed; 0 ignored; 0 measured; 1480 filtered out
|
|
```
|
|
|
|
### Code compiles
|
|
```bash
|
|
cargo check --all-targets
|
|
# Finished `dev` profile [unoptimized + debuginfo](s) in X.XXs
|
|
```
|
|
|
|
### No fs::read of unbounded files
|
|
- `FileSource::open()` only uses `memmap2::Mmap::map()`
|
|
- No fallback to `std::fs::read()` for entire files
|
|
- Per Anti-Patterns line ~977: rejects `fs::read` of unbounded files
|
|
|
|
## Memory Behavior
|
|
|
|
### Before (fs::read + seek)
|
|
- Random access across a 5 GB PDF could force 5 GB of anonymous RSS
|
|
- Each `read_at` seeked to offset and read bytes into a new Vec
|
|
- No sharing between readers of the same file
|
|
|
|
### After (mmap)
|
|
- File bytes live in OS page cache (shared across processes)
|
|
- `read_at` slices the mmap region (zero-copy until Vec conversion)
|
|
- RSS scales with accessed portions, not total file size
|
|
- OS can evict unused pages under memory pressure
|
|
|
|
## Acceptance Criteria
|
|
|
|
| Criterion | Status |
|
|
|-----------|--------|
|
|
| Route all file input through PdfSource trait | PASS - FileSource implements PdfSource |
|
|
| Backed by memmap2 | PASS - uses memmap2::Mmap::map() |
|
|
| Reject fs::read of unbounded files | PASS - no fs::read fallback |
|
|
| File bytes live in OS page cache | PASS - mmap uses page cache |
|
|
| Enables 'small-on-disk must not force multi-GB residency' | PASS - RSS scales with access, not file size |
|
|
|
|
## References
|
|
|
|
- Plan: File I/O decision (line 138): "memmap2 for zero-copy random access"
|
|
- Plan: Anti-Patterns (line ~995): "Loading the whole PDF into memory when memmap2 / range-read would suffice"
|
|
- Plan: Memory targets (lines 66-82): Peak RSS targets for large PDFs
|
|
|
|
## Notes
|
|
|
|
- The implementation returns `Vec<u8>` from `read_at()` for API compatibility
|
|
- Future optimization could return `Cow<'_, [u8]>` to avoid copies when caller owns the source
|
|
- NamedTempFile in tests keeps the file alive during the test to avoid "No such file" errors
|