The native PyO3 module returns raw dicts via pythonize, but the Python SDK API expects typed dataclass objects (Document, Page, Metadata, etc.) to be consistent with the subprocess fallback and test expectations. Updated wrapper functions in __init__.py to convert native results: - extract(): wraps dict in Document.from_dict() - extract_stream(): wraps yielded page dicts in Page.from_dict() - get_metadata(): wraps dict in Metadata() - hash(): wraps string in Fingerprint.from_string() - classify(): wraps dict in Classification() - search(): wraps yielded match dicts in Match The native PyO3 entry points (extract, extract_text, extract_stream) were already implemented with: - extract: uses extract_pdf + pythonize for PyDict conversion - extract_text: uses extract_text for plain String return - extract_stream: uses extract_pdf_streaming with custom StreamIterator All kwargs parsing with strict validation (unknown kwargs raise TypeError) was already in place. Acceptance criteria: - pdftract.extract() returns Document object with pages/metadata - pdftract.extract_text() returns plain text string - pdftract.extract_stream() yields Page objects - Unknown kwarg raises TypeError
44 lines
1.5 KiB
Rust
44 lines
1.5 KiB
Rust
//! Example: Stream PDF extraction as NDJSON.
|
|
//!
|
|
//! Demonstrates memory-efficient streaming extraction using
|
|
//! `extract_pdf_ndjson`, which writes each page as a newline-delimited
|
|
//! JSON object immediately after extraction. This keeps memory usage
|
|
//! bounded regardless of document size.
|
|
//!
|
|
//! Usage:
|
|
//! cargo run --example extract_stream -- tests/fixtures/sample.pdf
|
|
|
|
use anyhow::Result;
|
|
use pdftract_core::{extract_pdf_ndjson, ExtractionOptions};
|
|
use std::env;
|
|
use std::io::{self, BufWriter};
|
|
use std::path::Path;
|
|
|
|
fn main() -> Result<()> {
|
|
// Get PDF path from command line, or use a default
|
|
let args: Vec<String> = env::args().collect();
|
|
let pdf_path = args.get(1).map(|s| s.as_str()).unwrap_or("tests/fixtures/sample.pdf");
|
|
|
|
// Extract with default options, streaming to stdout
|
|
let options = ExtractionOptions::default();
|
|
let stdout = BufWriter::new(io::stdout());
|
|
let metadata = extract_pdf_ndjson(Path::new(pdf_path), &options, stdout)?;
|
|
|
|
// Print summary to stderr (so it doesn't mix with NDJSON output)
|
|
eprintln!("Extraction complete:");
|
|
eprintln!(" Pages: {}", metadata.page_count);
|
|
eprintln!(" Spans: {}", metadata.span_count);
|
|
eprintln!(" Blocks: {}", metadata.block_count);
|
|
eprintln!(" Errors: {}", metadata.error_count);
|
|
|
|
if let Some(algo) = metadata.reading_order_algorithm {
|
|
eprintln!(" Reading order: {}", algo);
|
|
}
|
|
|
|
// Print diagnostics if any
|
|
for diag in &metadata.diagnostics {
|
|
eprintln!(" Diagnostic: {}", diag);
|
|
}
|
|
|
|
Ok(())
|
|
}
|