Commit graph

5 commits

Author SHA1 Message Date
jedarden
e6bf3dd290 feat(pdftract-3s2i): implement Phase 5.5.2 validation filter
Implement per-word validation filter for assisted-OCR BrokenVector path.

Changes:
- Add SpanSource::OcrAssisted variant to hybrid.rs
- Add Span::ocr_assisted() helper method
- Implement validate_ocr_with_position_hints() in ocr.rs
  - 5pt distance threshold for position validation
  - 0.4 confidence cap for rejected words
  - Linear scan for nearest-neighbor lookup
- Add unit tests for validation filter

Closes: pdftract-3s2i

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-24 04:57:17 -04:00
jedarden
9b5fbc9b5e feat(pdftract-bf-2y2rp): implement lazy stream decoding for PDF extraction
- Add decode_page_content_streams() function for per-page lazy decode
- Update extract_page_from_dict() to support lazy stream decoding
- Modify extract_pdf() and extract_pdf_ndjson() to enable lazy decoding
- Fix borrow checker issue in LazyPageIter::next()

This ensures content streams are decoded lazily per page and dropped
immediately after processing, keeping peak RSS flat across page count.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-23 12:30:26 -04:00
jedarden
71872aaf73 feat(pdftract-1eaxm): implement libpdftract C FFI library
Implement the libpdftract native FFI library as a cdylib + staticlib
with cbindgen-generated headers and full extern "C" API.

Components:
- crates/pdftract-libpdftract/ with cdylib + staticlib targets
- All 9 contract methods + utility functions as extern "C"
- cbindgen config and generated pdftract.h header
- pkg-config template (pdftract.pc.in)
- Homebrew formula template (distribution/homebrew/)
- vcpkg port template (distribution/vcpkg/)
- C conformance test (tests/conformance.c)

API features:
- Owned JSON strings returned via CString::into_raw()
- Caller frees with pdftract_free() (not libc free())
- Thread-local error storage (pdftract_last_error)
- Thread-safe and reentrant (no global mutable state)
- ABI version function for compatibility checking

Verification:
- cargo build produces libpdftract.so and libpdftract.a
- Conformance test compiles and runs successfully
- Thread safety verified with 4 concurrent threads

References:
- Plan line 3477: SDK Architecture / The Ten SDKs
- Bead: pdftract-1eaxm

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-23 08:55:12 -04:00
jedarden
9c7f9d3e37 test(pdftract-5ya9x): update memory roundtrip test to 10,000 iterations
- Updated test_api_null.c to run 10,000 alloc/free cycles (was 100)
- Updated verification note to mark memory roundtrip as PASS
- Improved stream_next implementation to use reference-based approach
  instead of Box::from_raw/leak dance for cleaner memory handling

All acceptance criteria for pdftract-5ya9x now PASS:
- 12 exported symbols verified via nm -D
- C client tests (test_api.c, test_api_null.c)
- C++ client test (test_extract.cpp)
- Null pointer safety
- Panic safety (catch_unwind on all entry points)
- Memory roundtrip (10,000 iterations)
- Thread safety (8 pthreads)

Co-Authored-By: Claude Code <noreply@anthropic.com>
2026-05-23 08:13:31 -04:00
jedarden
f26f9e3c0f feat(pdftract-uyhq7): scaffold libpdftract cdylib+staticlib crate
Add pdftract-libpdftract as 4th workspace member with dual crate-type
configuration (cdylib + staticlib) for C/C++ SDK flexibility.

Changes:
- Create crates/pdftract-libpdftract/Cargo.toml with cdylib+staticlib
- Create crates/pdftract-libpdftract/src/lib.rs scaffold
- Update root Cargo.toml workspace.members
- Configure [lib] name="pdftract" for correct artifact naming

Artifacts produced:
- target/debug/libpdftract.so (shared, cdylib)
- target/debug/libpdftract.a (static, staticlib)

Acceptance criteria:
- PASS: cargo build -p pdftract-libpdftract produces libpdftract.so/.a
- PASS: Workspace cargo build builds all 4 crates without regression
- PASS: cargo metadata shows pdftract-libpdftract in workspace members
- PASS: nm -D shows no exported symbols (empty API scaffold)

References: pdftract-uyhq7, Phase SDK epic

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-23 07:29:47 -04:00