Commit graph

4 commits

Author SHA1 Message Date
jedarden
d14ec92fcb feat(pdftract-3zhf): add unified TableDetector::detect entry point
Add unified detect() method to TableDetector that combines both
line-based and borderless table detection pipelines. This completes
the coordinator bead for Phase 7.2: Table Detection and Structure
Reconstruction.

All child beads (7.2.1-7.2.6) are closed:
- 7.2.1: Line-based detection (path segment clustering)
- 7.2.2: Borderless detection (x0 alignment heuristic)
- 7.2.3: Span-to-cell assignment (centroid containment)
- 7.2.4: Header row detection (bold + StructTree TH)
- 7.2.5: Merged cell detection (missing interior edges)
- 7.2.6: Table JSON output schema integration

Critical tests pass:
- 5x3 bordered table (15 cells extracted)
- Merged header cell colspan=3
- Borderless 3-column table detection
- Two-page table continuation detection

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-24 00:51:59 -04:00
jedarden
e11b487b19 feat(pdftract-2w3r): implement StructTree coverage check and XY-cut fallback
Implements Phase 7.1.4: coverage-based fallback for Suspects-tagged PDFs.

## Changes

### New files
- crates/pdftract-core/src/parser/marked_content.rs: MCID tracking and CoverageResult
- crates/pdftract-core/tests/struct_tree_coverage.rs: Integration tests

### Modified files
- crates/pdftract-core/src/parser/catalog.rs: MarkInfo::requires_coverage_check(), ReadingOrderAlgorithm enum
- crates/pdftract-core/src/parser/struct_tree.rs: check_coverage_for_pages(), ParentTreeResolver::compute_coverage()
- crates/pdftract-core/src/extract.rs: MCID tracking per page, coverage check integration

## Implementation

Coverage calculation:
- claimed_mcids = MCIDs resolving to non-Artifact StructElem via ParentTree
- total_mcids = All MCIDs from marked-content sequences on the page
- coverage = claimed_mcids / total_mcids

Fallback rule (per plan §7.1 line 2572):
- If /MarkInfo /Suspects is true AND coverage < 0.80 → use XY-cut
- Otherwise → use StructTree

## Tests

Unit tests (20):  All passing
- Suspects false + 50% coverage → no fallback
- Suspects true + 95% coverage → no fallback
- Suspects true + 60% coverage → fallback
- Edge cases: no MCIDs, 80% threshold, multi-page

Integration tests: ⚠️ Skipped (malformed fixture PDFs)
- tagged-suspects-*.pdf have invalid xref tables
- Core functionality verified by unit tests
- Fixtures need regeneration or real-world tagged PDFs

## Acceptance Criteria (from pdftract-2w3r)

- [x] Unit tests: Suspects false + 50% coverage → no fallback
- [x] Unit tests: Suspects true + 95% coverage → no fallback
- [x] Unit tests: Suspects true + 60% coverage → fallback
- [x] Per-page diagnostic appears in receipts when fallback triggers
- [x] reading_order_algorithm field set to "struct_tree" or "xy_cut"
- [ ] Integration test: tagged-suspects-true.pdf (fixture malformed)

Refs: pdftract-2w3r, plan §7.1 line 2554, INV-8

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-23 20:53:25 -04:00
jedarden
9b5fbc9b5e feat(pdftract-bf-2y2rp): implement lazy stream decoding for PDF extraction
- Add decode_page_content_streams() function for per-page lazy decode
- Update extract_page_from_dict() to support lazy stream decoding
- Modify extract_pdf() and extract_pdf_ndjson() to enable lazy decoding
- Fix borrow checker issue in LazyPageIter::next()

This ensures content streams are decoded lazily per page and dropped
immediately after processing, keeping peak RSS flat across page count.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-23 12:30:26 -04:00
jedarden
7566ab0f0f feat(pdftract-36wlt): implement verify-receipt subcommand + verifier protocol
Implement the pdftract verify-receipt subcommand and the underlying verifier
protocol. The verifier validates receipts against original PDFs by checking:
(1) PDF fingerprint matches, (2) at least one span has bbox overlap >= 90% IoU,
(3) that span's NFC-normalized SHA-256 equals the receipt's content_hash.

Modules:
- crates/pdftract-core/src/receipts/verifier.rs: verifier protocol logic
- crates/pdftract-cli/src/verify_receipt.rs: CLI integration
- crates/pdftract-core/src/document.rs: PDF parsing helpers

Exit codes:
- 0: success
- 10: fingerprint mismatch
- 11: bbox mismatch (no span meets 90% IoU threshold)
- 12: content hash mismatch
- 1: extraction failed

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-23 04:00:15 -04:00