pdftract

History

jedarden 1195216fe8 feat(pdftract-43sg2): implement single-pass per-file parse pipeline for grep Implement the worker_run() function that processes a single FileWorkItem into MatchEvents via Phase 1 (lexer/object/xref) + Phase 3 (content streams) + Phase 4 span builder (skipping Phase 4.5 reading-order detection). Key changes: - Add ProgressEvent enum with FileStart, FileProgress, FileDone, FileSkipped variants - Create worker.rs with worker_run() function for single-pass PDF parsing - Implement extract_spans_from_page() using process_with_mode() for Phase 3 - Implement group_glyphs_into_spans() for span building without reading order - Add compute_fingerprint_for_grep() for document fingerprinting - Handle encrypted PDFs with diagnostic emission - Support --invert-match with synthetic event emission for zero-match spans - Fix encryption module compilation issues (rc4/aes_256 imports, RC4 implementation) - Add crossbeam-channel dependency for event channels The worker skips reading-order detection (Phase 4.5) since grep doesn't need it, cutting per-file CPU by ~30-40% on typical pages. Closes: pdftract-43sg2		2026-05-26 20:15:39 -04:00
..
pdftract-cer-diff	docs(pdftract-aawrz): add LICENSE-MIT and LICENSE-APACHE files	2026-05-23 10:36:28 -04:00
pdftract-cli	feat(pdftract-43sg2): implement single-pass per-file parse pipeline for grep	2026-05-26 20:15:39 -04:00
pdftract-core	feat(pdftract-43sg2): implement single-pass per-file parse pipeline for grep	2026-05-26 20:15:39 -04:00
pdftract-libpdftract	feat(pdftract-3s2i): implement Phase 5.5.2 validation filter	2026-05-24 04:57:17 -04:00
pdftract-py	feat(pdftract-3h9xo): implement threads JSON output + schema integration	2026-05-25 13:40:15 -04:00