Implement repair_hyphenation() that detects and repairs end-of-line hyphenation within blocks. Joins hyphenated words across line breaks when the hyphen is at the column right edge and the continuation starts with a lowercase letter. Key features: - Detects hyphens: -, ‐ (U+2010), ‑ (U+2011), soft hyphen (U+00AD) - Right-edge detection: span bbox.x1 within 5% of column width - Lowercase continuation check to avoid joining sentences - Column-aware: only joins spans in same column - Cleans up empty spans/lines after repair Adds HasBBox and HyphenableSpan traits for flexible span types. Includes 9 comprehensive tests covering all acceptance criteria. Fixes pre-existing test cases in schema module (missing column field). Closes: pdftract-5o6hx Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
35 lines
1.3 KiB
Rust
35 lines
1.3 KiB
Rust
//! Layout analysis for Phase 4.
|
|
//!
|
|
//! This module implements block-level layout analysis including:
|
|
//! - Caption classification (caption.rs)
|
|
//! - Code block classification (code.rs)
|
|
//! - Column label assignment (columns.rs)
|
|
//! - Line formation (line.rs)
|
|
//! - Readability aggregation (readability.rs)
|
|
//! - English wordlist for dict coverage scoring (wordlist.rs)
|
|
//! - Text correction pipeline (correction.rs)
|
|
//!
|
|
//! Phase 4 organizes extracted text into semantic blocks (paragraphs,
|
|
//! headings, figures, captions, etc.) based on spatial and font metrics.
|
|
|
|
pub mod caption;
|
|
pub mod code;
|
|
pub mod columns;
|
|
pub mod correction;
|
|
pub mod line;
|
|
pub mod readability;
|
|
pub mod wordlist;
|
|
|
|
pub use caption::{classify_caption, classify_page_captions, Block, PageContext};
|
|
pub use code::{
|
|
classify_code, classify_page_code_blocks, is_fixed_pitch_flag, is_monospace_font_name,
|
|
is_monospace_span, MonospaceSpan,
|
|
};
|
|
pub use columns::{assign_columns_to_lines, assign_columns_to_spans, Column};
|
|
pub use correction::{detect_and_repair_mojibake, repair_hyphenation, HyphenableSpan};
|
|
pub use line::{
|
|
cluster_spans_into_lines, compute_baseline, group_lines_into_blocks, union_bboxes, BlockInput,
|
|
HasBBox, HasFontSize, Line, LineDirection, LineMetadata,
|
|
};
|
|
pub use readability::{aggregate_page_readability, ScoredSpan};
|
|
pub use wordlist::is_english_word;
|