jedarden/pdftract

Author	SHA1	Message	Date
jedarden	1c6f26ecaa	fix(bf-4mkhv): clean up unused imports in hash.rs The bead description mentioned compile errors in hash.rs from API drift, but those errors were either already fixed or misattributed. The API usage was already correct: - compute_fingerprint already takes 3 arguments with source - len() already propagates Result with ? - read_at method already used correctly - Catalog fields accessed via trailer correctly Only cleanup: removed unused std::fs::File and std::io imports. Verification: notes/bf-4mkhv.md	2026-06-01 09:43:48 -04:00
jedarden	76f28edc99	docs(pdftract-2rc4): regenerate JSON schema with updated descriptions - Add missing descriptions for AnnotationSpecificJson fields - Schema generated via: cargo run --manifest-path=xtask/Cargo.toml --bin gen_schema - All JSON schema tests pass (6/6)	2026-06-01 07:26:35 -04:00
jedarden	432514d350	wip: AcroForm improvements, debug tooling, test corpus, and fixture updates Collects in-progress work across forms (Ch/Tx field handling, value_text edge cases), layout corrections, stream parser fixes, conformance test expansion, security audit test (TH-08), stream-decoder bomb fixture, debug examples reorganization under examples/debug/, sdk module scaffold, xtask CLI enhancements, and provenance entries for new fixtures. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-30 09:48:14 -04:00
jedarden	bb7146cffe	fix(pdftract-2uk9z): wrap native module results in typed Python objects The native PyO3 module returns raw dicts via pythonize, but the Python SDK API expects typed dataclass objects (Document, Page, Metadata, etc.) to be consistent with the subprocess fallback and test expectations. Updated wrapper functions in __init__.py to convert native results: - extract(): wraps dict in Document.from_dict() - extract_stream(): wraps yielded page dicts in Page.from_dict() - get_metadata(): wraps dict in Metadata() - hash(): wraps string in Fingerprint.from_string() - classify(): wraps dict in Classification() - search(): wraps yielded match dicts in Match The native PyO3 entry points (extract, extract_text, extract_stream) were already implemented with: - extract: uses extract_pdf + pythonize for PyDict conversion - extract_text: uses extract_text for plain String return - extract_stream: uses extract_pdf_streaming with custom StreamIterator All kwargs parsing with strict validation (unknown kwargs raise TypeError) was already in place. Acceptance criteria: - pdftract.extract() returns Document object with pages/metadata - pdftract.extract_text() returns plain text string - pdftract.extract_stream() yields Page objects - Unknown kwarg raises TypeError	2026-05-28 21:18:38 -04:00
jedarden	f85e5149dd	feat(pdftract-91e1i): HTTP fetch sequence implementation Implement orchestration layer connecting HttpRangeSource to Phase 1.3 xref resolver and Phase 1.4 document model for remote PDF access: - Document::open_remote() public API for remote PDF loading - Progressive tail fetch (16 KB → 1 MB) for startxref location - Xref forward-scan disabled for remote sources (via is_remote check) - Page-by-page on-demand fetch via HttpRangeSource caching - Resource lazy load through XrefResolver cache - HEAD probe with 405 fallback, no Content-Length handling Acceptance criteria: ✅ open_remote(url) returns Document with correct page count ✅ HEAD failure modes (405, no Content-Length, 401) handled ✅ xref forward-scan disabled for remote (is_remote check) ✅ Page-by-page on-demand fetch (HttpRangeSource LRU cache) ✅ INV-8 maintained (all errors return Result) Files modified: - crates/pdftract-core/src/document.rs (Document::open_remote, from_source) - crates/pdftract-core/src/remote.rs (progressive tail fetch) - crates/pdftract-core/src/lib.rs (re-exports) Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-28 13:17:00 -04:00
jedarden	06079a16b2	feat(pdftract-4bylb): implement Docstrum fallback for reading order Some checks are pending Schema Generation Validation / Validate JSON Schema (push) Waiting to run Details Schema Generation Validation / Validate JSON Syntax (push) Waiting to run Details Implement O'Gorman 1993 Docstrum algorithm for reading order detection on irregular layouts (magazines with sidebars) where XY-cut produces fragmented regions. Implementation: - k=5 nearest neighbors per block (Docstrum standard) - Euclidean center-to-center distance in PDF user space - Angle constraints: ±30° from horizontal (within-line) and vertical (between-line) - Root detection: nodes with no incoming edges from blocks above - Root sorting by (column ASC, y DESC) - DFS traversal per component in y-then-x order Acceptance criteria PASS: - Magazine main+sidebar: 2 components; main first, sidebar second - Pathological scattered: each a root, visited (column, y desc) - All-one-line horizontal: 1 component, left-to-right - All-one-column vertical: 1 component, top-to-bottom Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-28 04:16:24 -04:00
jedarden	38b7496c70	feat(pdftract-1ofnz): implement detect_line_direction with unicode-bidi - Add detect_line_direction() function using unicode_bidi::bidi_class - Count L (LTR) vs R/AL (RTL) characters, return dominant direction - Default to Ltr for empty/neutral-only strings (per bead acceptance criteria) - Return Mixed only when LTR and RTL counts are tied (both > 0) - Add comprehensive tests for Latin, Arabic, Hebrew, Cyrillic, and edge cases - Fix header_footer test: remove nonexistent reading_order_rank field Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-28 00:33:49 -04:00
jedarden	8cfbe70ab7	fix(pdftract-1jkme): add missing Arc import to correction.rs test module The test module was using Arc::from("Helvetica") but Arc was not in scope. Added `use std::sync::Arc;` to fix compilation errors. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-28 00:21:46 -04:00
jedarden	8a5d9e9ff5	test(pdftract-1q4ku): add acceptance criteria tests for score_span_readability The score_span_readability function was already fully implemented in readability.rs. This commit adds comprehensive tests for the acceptance criteria of bead pdftract-1q4ku: - AC1: All-printable English high coverage -> > 0.9 - AC2: All-U+FFFD -> significantly reduced (< 0.7) - AC3: All-whitespace -> whitespace_score=0 (binary penalty) - AC4: Low confidence -> scaled by confidence_floor - AC5: Non-English -> dict_coverage forced to 1.0 - AC6: Ligature split -> integrity 0 lowers score Also adds tests verifying: - Empty span returns 0.0 - Confidence threshold (0.6 -> 1.0) - Whitespace bounds [0.05, 0.40] - Printable fraction calculation - Dict coverage enabled/disabled behavior - Non-English lang tag handling (en, en-US, zh, None) All tests pass. The implementation correctly computes: - 0.35 * printable_fraction - 0.30 * dict_coverage (disabled for non-English) - 0.15 * whitespace_score (binary in/out bounds) - 0.10 * ligature_integrity (binary split detection) - 0.10 * confidence_floor (min(1.0, conf/0.6)) Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-28 00:21:46 -04:00
jedarden	98964e06fe	fix(pdftract-2j4zl): fix header/footer duplicate counting bug The detect_headers_and_footers function was incrementing classified_count every time a block was classified, even if it was already classified from a previous sliding window iteration. With 10 pages and identical headers, blocks on pages 1-9 would be reclassified multiple times (31 classifications instead of 10). Fixed by checking if block is already "header" or "footer" before incrementing the counter. All 25 header_footer tests now pass. Refs: pdftract-2j4zl Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-28 00:04:13 -04:00
jedarden	c19f02c783	fix(pdftract-3jekw): fix watermark_formula test type annotations Fixed compilation errors in watermark_formula.rs tests by: - Using Block<()> as the concrete type for generic Block<S> - Creating a make_test_block() helper to avoid repetition - Removing unused TestBlock struct The stub functions classify_watermark and classify_formula were already correctly implemented and always return false (Phase 4 stubs). Acceptance criteria: - BlockKind::Watermark variant exists: PASS - BlockKind::Formula variant exists: PASS - classify_watermark always false: PASS - classify_formula always false: PASS - No v0.1.0 block has kind=Watermark or Formula: PASS References: plan.md Phase 4.4 (line 1709) + 4.6 watermark note (line 1752) Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-27 23:37:15 -04:00
jedarden	336e48a7dd	feat(pdftract-3jekw): implement watermark and formula detection stubs Add Phase 4 stub classifiers for Watermark and Formula block kinds. Full detection deferred to Phase 7 per plan section 4.4 (line 1709) and 4.6 watermark note (line 1752). Changes: - Create crates/pdftract-core/src/layout/watermark_formula.rs with classify_watermark() and classify_formula() stubs returning false - Update crates/pdftract-core/src/layout/mod.rs to export the stubs - Add comprehensive module documentation linking to Phase 7 research Acceptance criteria: - BlockKind::Watermark and BlockKind::Formula variants exist (pre-existing) - classify_watermark always false - classify_formula always false - No v0.1.0 block has kind=Watermark or Formula Refs: pdftract-3jekw	2026-05-27 23:32:22 -04:00
jedarden	fda17d4d77	feat(pdftract-2rkc1): implement column confirmation with >= 3 line threshold Implement confirm_columns function that partitions page into candidate columns (regions between consecutive gaps + before-first + after-last), counts unique lines whose first span's x0 falls within each candidate's x-range, and promotes candidates with line_count >= 3 to confirmed columns. Supporting code: - ColumnGap struct with lo/hi bounds, width(), midpoint() - detect_column_gaps function for zero-coverage region detection - HasFirstSpan trait for first span bbox access - CandidateColumn struct for tracking x_range and line_count All 49 column tests pass, including all acceptance criteria. Bead: pdftract-2rkc1 Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-27 23:09:01 -04:00
jedarden	ccd13f1bfa	feat(pdftract-1vrxg): implement word-break normalization Implement `normalize_word_breaks(span: &mut Span, script_hint: Option<Script>) -> u32` that strips zero-width formatting characters based on script requirements. - U+200B (zero-width space) and U+FEFF (BOM): ALWAYS stripped (never content) - U+200C (ZWNJ) and U+200D (ZWJ): stripped unless script requires them - Preserved for Arabic, Hebrew, Devanagari, Bengali, Indic, Thai, Lao, Tibetan, Myanmar, Khmer, Sinhala (orthographic in complex scripts) - Stripped for Latin and Unknown scripts (noise in extracted text) - `detect_script()` function identifies dominant script from Unicode codepoint ranges (threshold: >=3 matching characters) - `Script` enum with `preserves_joiners()` method determines ZWNJ/ZWJ handling - Returns count of stripped characters (bytes) Acceptance criteria: - "auto\u{200B}mation" (Latin) -> "automation" ✓ - Arabic ZWNJ/ZWJ with script_hint=Arabic -> preserved ✓ - Arabic ZWNJ/ZWJ with script_hint=None -> stripped ✓ - "\u{FEFF}hello" -> "hello" (BOM always stripped) ✓ - Devanagari ZWJ with script_hint=Devanagari -> preserved ✓ Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-27 22:55:57 -04:00
jedarden	f1ac77281b	feat(pdftract-4md5z): implement XY-cut recursive reading order algorithm Phase 4.5 XY-cut reading order determination for block-level layout analysis. Implementation: - xy_cut() function with recursive widest-whitespace split - Vertical split first (columns dominate), then horizontal split - Single column detection via gap analysis (blocks on both sides of gap) - Projection histogram for robust gap detection (1-point bins) - MAX_DEPTH=20 to prevent stack overflow - XYCutResult with order, region_count, small_region_count, algorithm Acceptance criteria (PASS): - 2-column page: all left-column blocks before all right-column blocks - 3-column page: col0, col1, col2 order preserved - Single column: top-to-bottom order (y descending) - Full-width heading + 2 columns: heading first, then columns - Small region count signals Docstrum trigger (>10 regions with <3 blocks) - All unit tests pass Module: crates/pdftract-core/src/layout/reading_order.rs Tests: 16 tests covering basic cases, edge cases, split detection Closes: pdftract-4md5z	2026-05-26 18:37:31 -04:00
jedarden	6a05f7e247	fix(pdftract-tuky): fix color clamping test and verify Phase 3.1 coordinator Fixes: - Corrected test_color_device_rgb_clamped expected value from "#ff8080" to "#ff0080" (G value -0.5 should clamp to 0.0, not 0.5) - Fixed lifetime annotation in readability.rs (Cow<str> -> Cow<'_, str>) - Fixed unused_must_use warning in page_class.rs test Verification (notes/pdftract-tuky.md): - All 8 children of Phase 3.1 coordinator are closed - q/Q 64-level depth limit verified (test_64_nested_q_calls_succeed) - Td chain accumulation verified (test_td_chain) - Tm/Td ordering correct per ISO 72-bit spec - /Rotate normalization implemented in child pdftract-1jlpy - All 6 color operators tracked (72 graphics_state tests pass) Closes: pdftract-tuky	2026-05-26 16:36:01 -04:00
jedarden	8bc63ac8b3	feat(pdftract-56vwd): implement build_x0_histogram for column detection - Add build_x0_histogram() function for 1pt-resolution x0 histogram - Add HasBBox trait for generic bbox access - Implement for [f32; 4] and [f64; 4] types - Clamp out-of-bounds x0 values with diagnostics - Add 7 tests covering single/multiple spans, clamping, rounding, A4 pages Acceptance criteria PASS: - Single span at x0=100: hist[100] == 1 - Multiple spans: hist[100]==2, hist[200]==2, hist[300]==1 - Negative x0 clamped to hist[0] with diagnostic - Empty spans returns zero Vec Closes: pdftract-56vwd	2026-05-25 11:59:27 -04:00
jedarden	6000c654ce	fix: resolve compilation errors across codebase - Fixed missing fields in BlockJson, SpanJson, ExtractionOptions initializations - Added feature gates to ocr_integration tests for conditional compilation - Fixed McpServerState::new calls to include audit writer argument - Fixed CCITTFaxDecoder::decode calls to use instance method - Fixed type casts for ObjRef::new calls - Fixed serde_json::Value method calls (is_some -> !is_null) - Fixed ProfileType test feature gates - Worked around lifetime issues in schema roundtrip tests These changes fix numerous compilation errors that were blocking the codebase from building. The main library and tests now compile successfully. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-25 08:38:04 -04:00
jedarden	aebe37ca84	feat(pdftract-5o6hx): implement hyphenation repair Implement repair_hyphenation() that detects and repairs end-of-line hyphenation within blocks. Joins hyphenated words across line breaks when the hyphen is at the column right edge and the continuation starts with a lowercase letter. Key features: - Detects hyphens: -, ‐ (U+2010), ‑ (U+2011), soft hyphen (U+00AD) - Right-edge detection: span bbox.x1 within 5% of column width - Lowercase continuation check to avoid joining sentences - Column-aware: only joins spans in same column - Cleans up empty spans/lines after repair Adds HasBBox and HyphenableSpan traits for flexible span types. Includes 9 comprehensive tests covering all acceptance criteria. Fixes pre-existing test cases in schema module (missing column field). Closes: pdftract-5o6hx Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-24 17:24:48 -04:00
jedarden	d84f8da3a4	feat(pdftract-5qj50): implement mojibake detection and repair via encoding_rs Implements Phase 4.7 Correction Pipeline step 3: mojibake detection and repair for Latin-1 bytes misinterpreted as UTF-8. Changes: - Add layout::correction module with detect_and_repair_mojibake function - Implement CorrectableText trait for mutable text access - Add trait implementations for hybrid::Span and schema::SpanJson - Make encoding_rs a non-optional dependency (was cjk-gated) - Detection heuristic: 2+ occurrences of telltale sequences (Ã©, Ã¨, â€™, etc.) - Re-decode via encoding_rs::WINDOWS_1252 when detected - Accept repair only if readability score improves by >0.05 epsilon - Fast-path pass-through for ASCII-only and clean UTF-8 text Closes: pdftract-5qj50 Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-24 17:01:53 -04:00
jedarden	cce26bb6b6	feat(pdftract-64j83): implement column label assignment to Span.column + Line.column - Add column: Option<u32> field to Span in hybrid.rs - Create layout/columns.rs module with: - Column struct (index + x_range) - assign_columns_to_spans() - assign by x_range containing bbox[0] - assign_columns_to_lines() - propagate via mode (>50% dominance) - HasBBoxAndColumn and HasSpansWithColumn traits - Update layout/mod.rs to export column types - Fix test fixtures in inspect/render (add column: None) Acceptance criteria: - 2-column page span at x0=50 -> Some(0), x0=350 -> Some(1) - Full-width heading line -> None (mixed spans) - Single-column page -> all spans Some(0) - Inter-column gap -> None Closes: pdftract-64j83 Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-24 14:45:19 -04:00
jedarden	a14787794c	feat(pdftract-6bwq4): implement baseline clustering algorithm Implement cluster_spans_into_lines for Phase 4.2 line formation. Groups spans into lines by baseline proximity using threshold 0.5 * median_font_size. - Add HasFontSize trait for types with font_size - Implement cluster_spans_into_lines function - Compute baseline for each span - Sort by baseline ASC - Sweep and cluster within threshold - Emit Line per cluster - Sort spans by x0 within each line - Add finalize_line_cluster helper - Export new items from layout module Tests: All 11 acceptance criteria tests pass - Spans baselines 100, 100.5, 105 with median 12: one line - Spans baselines 100, 110 with median 12: two lines - Superscript stays on same line as base text - Empty input produces empty output - Threshold is 0.5 * median_font_size (not hardcoded) Closes: pdftract-6bwq4	2026-05-24 10:39:01 -04:00
jedarden	d3c4ecd268	feat(pdftract-8n270): implement code block detection Implement Phase 4.4 code block classification for detecting indented monospace code blocks. Features: - is_monospace_font_name: Check font name for monospace indicators (mono, courier, code, fixed, console - case-insensitive) - is_fixed_pitch_flag: Check FontDescriptor bit 0 (FixedPitch) - classify_code: Classify block as code if all spans monospace AND indented ≥ 2em from column baseline - classify_page_code_blocks: Post-processing pass to upgrade paragraph blocks to code kind Acceptance criteria: - All-Courier, indented 24pt, font_size 12pt (2em=24): Code ✓ - All-monospace, not indented: NOT Code ✓ - Mixed serif+monospace: NOT Code ✓ - One serif span at end: NOT Code ✓ - FixedPitch flag set, no "Mono" in name: STILL Code ✓ Closes: pdftract-8n270 Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-24 10:04:22 -04:00
jedarden	b96c3bfd37	feat(pdftract-9wevc): implement 20k English wordlist for readability scoring Implement compile-time phf::Set of 20,000 common English words for dictionary coverage scoring in readability analysis (Phase 4.7). Key changes: - Added wordlist-en-20k.txt (20k frequency-sorted English words) - Extended build.rs to generate phf::Set from wordlist - Added layout/wordlist.rs module with is_english_word() API - Added wordlist benchmarks (< 100 ns lookup achieved) Test results: - All 9 unit tests pass - Benchmarks: 13-62 ns per lookup (well under 100 ns requirement) - Binary size: Estimated ~200-220 KB (within 250 KB limit) Closes: pdftract-9wevc Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-24 09:29:13 -04:00
jedarden	508ca5d0bb	feat(pdftract-fy89c): implement line-to-block heuristic detector with 5 ordered triggers Implement Phase 4.4 block formation with 5 ordered heuristics for grouping lines into semantic blocks (paragraphs, headings, etc.): 1. Vertical gap > 1.5 * line_height → new block 2. Indent change > 0.03 * column_width → new block 3. Font size change > 1pt → new block 4. Rendering mode change → new block 5. Column boundary → MANDATORY block break Changes: - Extended Line<S> with median_font_size, rendering_mode, column fields - Added LineMetadata trait for abstracting line representations - Added Block<S> and BlockInput<L> structs for block representation - Implemented group_lines_into_blocks() with column-aware sorting All acceptance criteria tests pass (21/21). Closes: pdftract-fy89c	2026-05-24 06:14:43 -04:00
jedarden	e6bf3dd290	feat(pdftract-3s2i): implement Phase 5.5.2 validation filter Implement per-word validation filter for assisted-OCR BrokenVector path. Changes: - Add SpanSource::OcrAssisted variant to hybrid.rs - Add Span::ocr_assisted() helper method - Implement validate_ocr_with_position_hints() in ocr.rs - 5pt distance threshold for position validation - 0.4 confidence cap for rejected words - Linear scan for nearest-neighbor lookup - Add unit tests for validation filter Closes: pdftract-3s2i Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-24 04:57:17 -04:00
jedarden	99709354f5	feat(pdftract-oh30a): implement per-page readability aggregation Implement char-weighted median aggregation of per-span readability scores into a page-level score stored in extraction_quality.readability. Algorithm: - Collect (score, char_count) pairs from spans - Sort by score ascending - Walk sorted list accumulating character counts - Return score at half-total-char position Acceptance criteria: - Single span: returns its score - Multiple spans: char-weighted median (longer spans count more) - Empty page: returns 0.0 - All-perfect: returns 1.0 Closes: pdftract-oh30a Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-24 03:28:41 -04:00
jedarden	2cf02c6b2b	feat(pdftract-sdx9z): implement Line struct and baseline computation - Add layout::line module with Line<S> struct for Phase 4.2 line formation - Implement compute_baseline() using plan formula: y0 + height * 0.2 - Add LineDirection enum with serde support (Ltr, Rtl, Mixed) - Add union_bboxes() helper for computing span bbox unions - Add HasBBox trait for generic span type support Acceptance criteria: - compute_baseline([0,100,50,110]) returns 102.0 (height 10) - compute_baseline([0,100,50,100]) returns 100.0 (zero height) - LineDirection serde roundtrips to "ltr"/"rtl"/"mixed" - All 11 unit tests pass Closes: pdftract-sdx9z	2026-05-24 02:54:00 -04:00
jedarden	597f536b19	feat(pdftract-xzfkt): implement caption block classifier Add Phase 4 caption classification for detecting figure captions. Implements classify_caption() which identifies blocks as captions when: - Small font size (median < page body median) - Follows Figure block within 2 line heights - Same column as Figure Module: crates/pdftract-core/src/layout/caption.rs Acceptance criteria: - Block immediately below Figure, small font, same column → kind: Caption - Block 5 lines below Figure → NOT Caption (gap too large) - Block with body-size font below Figure → NOT Caption (font not smaller) - Block in different column from Figure → NOT Caption Tests: 9/9 passed covering all acceptance criteria plus edge cases. Closes: pdftract-xzfkt Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-24 01:56:34 -04:00

29 commits