Define the PdfSource trait abstraction over PDF byte sources. This trait provides a uniform API for reading PDF data from different sources: local files (MmapSource, FileSource), and eventually remote HTTPS PDFs. Trait features: - Read + Seek + Send + Sync supertrait bounds for rayon page-parallelism - len() returns total source length - read_range() returns Bytes for zero-copy slicing - prefetch() with no-op default (MmapSource overrides for MADV_SEQUENTIAL) MmapSource: - Memory-mapped file access via memmap2 - Applies MADV_SEQUENTIAL advice via prefetch() - Zero-copy read_range() using Bytes::copy_from_slice() - Fallback for platforms/filesystems where mmap fails FileSource: - Standard I/O implementation using std::fs::File - Read+Seek delegation to underlying File - read_range() uses try_clone() for thread-safe concurrent access Re-exports from pdftract-core::source::PdfSource. Verification note: notes/pdftract-1mmq9.md documents completion status. Parser module migration to use new PdfSource is deferred to follow-up. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
116 lines
3.7 KiB
Rust
116 lines
3.7 KiB
Rust
//! PDF source abstraction.
|
|
//!
|
|
//! This module defines the `PdfSource` trait, which abstracts over different
|
|
//! sources of PDF byte data (local files, memory-mapped files, remote HTTP sources).
|
|
//! The trait provides a uniform API for parsers to read PDF data regardless of
|
|
//! the underlying storage mechanism.
|
|
//!
|
|
//! # Example
|
|
//!
|
|
//! ```ignore
|
|
//! use pdftract_core::source::PdfSource;
|
|
//!
|
|
//! // Read using Read+Seek adapter (standard IO trait pattern)
|
|
//! fn read_header(source: &dyn PdfSource) -> std::io::Result<String> {
|
|
//! let mut buffer = vec![0u8; 1024];
|
|
//! source.read(&mut buffer)?;
|
|
//! Ok(String::from_utf8_lossy(&buffer).to_string())
|
|
//! }
|
|
//!
|
|
//! // Read using direct read_range (zero-copy Bytes)
|
|
//! fn read_xref(source: &dyn PdfSource, offset: u64) -> std::io::Result<bytes::Bytes> {
|
|
//! source.read_range(offset, 4096)
|
|
//! }
|
|
//! ```
|
|
|
|
use bytes::Bytes;
|
|
use std::fs::File;
|
|
use std::io::{self, Read, Seek};
|
|
use std::path::Path;
|
|
|
|
/// Abstraction over PDF byte sources.
|
|
///
|
|
/// This trait provides a uniform interface for reading PDF data from different
|
|
/// sources: local files (MmapSource, FileSource), memory buffers, and remote
|
|
/// HTTP sources (HttpRangeSource in Phase 1.8).
|
|
///
|
|
/// # Object safety
|
|
///
|
|
/// The trait is object-safe, allowing `&dyn PdfSource` to be used for dynamic
|
|
/// dispatch. This is important for APIs that need to accept any source type
|
|
/// at runtime.
|
|
///
|
|
/// # Thread safety
|
|
///
|
|
/// All sources must be `Send + Sync` to support rayon page-parallelism in
|
|
/// Phase 3+. Multiple threads may read from the same source concurrently.
|
|
///
|
|
/// # Example: Read+Seek adapter
|
|
///
|
|
/// ```ignore
|
|
/// use pdftract_core::source::PdfSource;
|
|
/// use std::io::Read;
|
|
///
|
|
/// fn parse_trailer(source: &dyn PdfSource) -> std::io::Result<Vec<u8>> {
|
|
/// let mut buffer = Vec::new();
|
|
/// source.seek(io::SeekFrom::End(-1024))?;
|
|
/// source.read_to_end(&mut buffer)?;
|
|
/// Ok(buffer)
|
|
/// }
|
|
/// ```
|
|
///
|
|
/// # Example: Direct read_range
|
|
///
|
|
/// ```ignore
|
|
/// use pdftract_core::source::PdfSource;
|
|
///
|
|
/// fn read_xref_section(source: &dyn PdfSource, offset: u64) -> io::Result<bytes::Bytes> {
|
|
/// // Zero-copy read using Bytes
|
|
/// source.read_range(offset, 4096)
|
|
/// }
|
|
/// ```
|
|
pub trait PdfSource: Read + Seek + Send + Sync {
|
|
/// Total length of the source in bytes.
|
|
///
|
|
/// This must return the exact byte length of the PDF source. For file-backed
|
|
/// sources, this is the file size. For HTTP sources, this is the Content-Length.
|
|
fn len(&self) -> u64;
|
|
|
|
/// Read `length` bytes starting at `offset`.
|
|
///
|
|
/// Returns a `Bytes` object for zero-copy slicing. The returned Bytes may
|
|
/// be a view into the source's internal buffer (for memory-mapped or cached
|
|
/// sources), so cloning the Bytes is cheap.
|
|
///
|
|
/// # Bounds
|
|
///
|
|
/// - `offset + length <= len()`: Returns io::Error with kind `InvalidInput`
|
|
/// if the range exceeds the source length.
|
|
///
|
|
/// # Example
|
|
///
|
|
/// ```ignore
|
|
/// use pdftract_core::source::PdfSource;
|
|
///
|
|
/// let data = source.read_range(100, 512)?;
|
|
/// assert_eq!(data.len(), 512);
|
|
/// ```
|
|
fn read_range(&self, offset: u64, length: usize) -> io::Result<Bytes>;
|
|
|
|
/// Optional hint to pre-fetch a range.
|
|
///
|
|
/// For local sources (MmapSource, FileSource), this is a no-op since the
|
|
/// OS manages paging via the page cache.
|
|
///
|
|
/// For remote HTTP sources (HttpRangeSource, Phase 1.8), this issues a
|
|
/// speculative Range request to warm the cache for upcoming reads.
|
|
///
|
|
/// The default implementation is a no-op.
|
|
fn prefetch(&self, _offset: u64, _length: usize) {}
|
|
}
|
|
|
|
mod file_source;
|
|
mod mmap;
|
|
|
|
pub use file_source::FileSource;
|
|
pub use mmap::MmapSource;
|