pdftract/crates/pdftract-core/src/layout/mod.rs
jedarden aebe37ca84 feat(pdftract-5o6hx): implement hyphenation repair
Implement repair_hyphenation() that detects and repairs end-of-line
hyphenation within blocks. Joins hyphenated words across line breaks
when the hyphen is at the column right edge and the continuation
starts with a lowercase letter.

Key features:
- Detects hyphens: -, ‐ (U+2010), ‑ (U+2011), soft hyphen (U+00AD)
- Right-edge detection: span bbox.x1 within 5% of column width
- Lowercase continuation check to avoid joining sentences
- Column-aware: only joins spans in same column
- Cleans up empty spans/lines after repair

Adds HasBBox and HyphenableSpan traits for flexible span types.
Includes 9 comprehensive tests covering all acceptance criteria.

Fixes pre-existing test cases in schema module (missing column field).

Closes: pdftract-5o6hx

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-24 17:24:48 -04:00

35 lines
1.3 KiB
Rust

//! Layout analysis for Phase 4.
//!
//! This module implements block-level layout analysis including:
//! - Caption classification (caption.rs)
//! - Code block classification (code.rs)
//! - Column label assignment (columns.rs)
//! - Line formation (line.rs)
//! - Readability aggregation (readability.rs)
//! - English wordlist for dict coverage scoring (wordlist.rs)
//! - Text correction pipeline (correction.rs)
//!
//! Phase 4 organizes extracted text into semantic blocks (paragraphs,
//! headings, figures, captions, etc.) based on spatial and font metrics.
pub mod caption;
pub mod code;
pub mod columns;
pub mod correction;
pub mod line;
pub mod readability;
pub mod wordlist;
pub use caption::{classify_caption, classify_page_captions, Block, PageContext};
pub use code::{
classify_code, classify_page_code_blocks, is_fixed_pitch_flag, is_monospace_font_name,
is_monospace_span, MonospaceSpan,
};
pub use columns::{assign_columns_to_lines, assign_columns_to_spans, Column};
pub use correction::{detect_and_repair_mojibake, repair_hyphenation, HyphenableSpan};
pub use line::{
cluster_spans_into_lines, compute_baseline, group_lines_into_blocks, union_bboxes, BlockInput,
HasBBox, HasFontSize, Line, LineDirection, LineMetadata,
};
pub use readability::{aggregate_page_readability, ScoredSpan};
pub use wordlist::is_english_word;