feat(pdftract-oh30a): implement per-page readability aggregation
Implement char-weighted median aggregation of per-span readability scores into a page-level score stored in extraction_quality.readability. Algorithm: - Collect (score, char_count) pairs from spans - Sort by score ascending - Walk sorted list accumulating character counts - Return score at half-total-char position Acceptance criteria: - Single span: returns its score - Multiple spans: char-weighted median (longer spans count more) - Empty page: returns 0.0 - All-perfect: returns 1.0 Closes: pdftract-oh30a Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
This commit is contained in:
parent
eb442cd16b
commit
99709354f5
4 changed files with 439 additions and 0 deletions
|
|
@ -3,12 +3,15 @@
|
|||
//! This module implements block-level layout analysis including:
|
||||
//! - Caption classification (caption.rs)
|
||||
//! - Line formation (line.rs)
|
||||
//! - Readability aggregation (readability.rs)
|
||||
//!
|
||||
//! Phase 4 organizes extracted text into semantic blocks (paragraphs,
|
||||
//! headings, figures, captions, etc.) based on spatial and font metrics.
|
||||
|
||||
pub mod caption;
|
||||
pub mod line;
|
||||
pub mod readability;
|
||||
|
||||
pub use caption::{Block, PageContext, classify_caption, classify_page_captions};
|
||||
pub use line::{Line, LineDirection, compute_baseline, union_bboxes, HasBBox};
|
||||
pub use readability::{aggregate_page_readability, ScoredSpan};
|
||||
|
|
|
|||
340
crates/pdftract-core/src/layout/readability.rs
Normal file
340
crates/pdftract-core/src/layout/readability.rs
Normal file
|
|
@ -0,0 +1,340 @@
|
|||
//! Per-page readability aggregation (Phase 4.7).
|
||||
//!
|
||||
//! This module implements the char-weighted median aggregation of per-span
|
||||
//! readability scores into a single page-level score.
|
||||
//!
|
||||
//! # Algorithm
|
||||
//!
|
||||
//! Per-page readability is computed as the **median** of per-span scores,
|
||||
//! **weighted by character count**. Longer spans contribute more to the
|
||||
//! median than shorter spans.
|
||||
//!
|
||||
//! # Formula
|
||||
//!
|
||||
//! 1. Collect `(score, char_count)` pairs for all spans
|
||||
//! 2. Sort by score ascending
|
||||
//! 3. Compute cumulative character count
|
||||
//! 4. Return the score at the half-total-char-count point
|
||||
//!
|
||||
//! # Edge Cases
|
||||
//!
|
||||
//! - Empty page (no spans): returns 0.0
|
||||
//! - Single span: returns its score
|
||||
//! - All spans have same score: returns that score
|
||||
|
||||
use std::borrow::Cow;
|
||||
|
||||
/// A span with a readability score.
|
||||
///
|
||||
/// This trait abstracts over different span representations (internal Span
|
||||
/// from hybrid.rs, SpanJson from schema, etc.) to allow the aggregation
|
||||
/// function to work with any span type that has text and a score.
|
||||
pub trait ScoredSpan {
|
||||
/// Get the text content of this span.
|
||||
fn text(&self) -> Cow<str>;
|
||||
|
||||
/// Get the readability score for this span [0.0, 1.0].
|
||||
///
|
||||
/// Returns None if the span has no score (should be excluded from aggregation).
|
||||
fn score(&self) -> Option<f32>;
|
||||
}
|
||||
|
||||
/// Aggregate per-span readability scores into a page-level score.
|
||||
///
|
||||
/// Computes the **char-weighted median** of span scores:
|
||||
/// - Sort spans by score ascending
|
||||
/// - Accumulate character counts
|
||||
/// - Return the score at the half-total-char point
|
||||
///
|
||||
/// # Arguments
|
||||
///
|
||||
/// * `spans` - Slice of spans with text and readability scores
|
||||
///
|
||||
/// # Returns
|
||||
///
|
||||
/// Page-level readability score in [0.0, 1.0], or 0.0 for empty pages.
|
||||
///
|
||||
/// # Examples
|
||||
///
|
||||
/// ```
|
||||
/// use pdftract_core::layout::readability::{aggregate_page_readability, TestSpan};
|
||||
///
|
||||
/// // Single span: page score = span score
|
||||
/// let spans = vec![TestSpan::new("Test", 0.9)];
|
||||
/// assert_eq!(aggregate_page_readability(&spans), 0.9);
|
||||
///
|
||||
/// // Char-weighted median: longer spans count more
|
||||
/// let spans = vec![
|
||||
/// TestSpan::new("a".repeat(100), 0.9), // 100 chars
|
||||
/// TestSpan::new("b".repeat(10), 0.5), // 10 chars
|
||||
/// TestSpan::new("c".repeat(100), 0.8), // 100 chars
|
||||
/// ];
|
||||
/// // Sorted: 0.5(10), 0.8(100), 0.9(100)
|
||||
/// // Cumsum: 10, 110, 210
|
||||
/// // Half = 105 -> score at cumsum >= 105 is 0.8
|
||||
/// assert_eq!(aggregate_page_readability(&spans), 0.8);
|
||||
/// ```
|
||||
pub fn aggregate_page_readability<T: ScoredSpan>(spans: &[T]) -> f32 {
|
||||
// Collect (score, char_count) pairs, excluding spans with no score
|
||||
let mut pairs: Vec<(f32, usize)> = spans
|
||||
.iter()
|
||||
.filter_map(|span| {
|
||||
let score = span.score()?;
|
||||
let char_count = span.text().chars().count();
|
||||
Some((score, char_count))
|
||||
})
|
||||
.collect();
|
||||
|
||||
// Edge case: empty page or no scored spans
|
||||
if pairs.is_empty() {
|
||||
return 0.0;
|
||||
}
|
||||
|
||||
// Edge case: single span
|
||||
if pairs.len() == 1 {
|
||||
return pairs[0].0;
|
||||
}
|
||||
|
||||
// Sort by score ascending
|
||||
pairs.sort_by_key(|&(score, _)| {
|
||||
// Sort f32 with total ordering: handle NaN by treating as +infinity
|
||||
score.to_bits()
|
||||
});
|
||||
|
||||
// Compute total character count
|
||||
let total_chars: usize = pairs.iter().map(|&(_, count)| count).sum();
|
||||
|
||||
// Edge case: all empty strings (total_chars = 0)
|
||||
if total_chars == 0 {
|
||||
return 0.0;
|
||||
}
|
||||
|
||||
// Find the score at the half-total-char point
|
||||
let half_chars = total_chars / 2;
|
||||
let mut cumulative = 0;
|
||||
|
||||
for (score, count) in &pairs {
|
||||
cumulative += count;
|
||||
if cumulative > half_chars {
|
||||
return *score;
|
||||
}
|
||||
}
|
||||
|
||||
// Fallback: return the highest score (should not reach here with valid data)
|
||||
pairs.last().map(|&(score, _)| score).unwrap_or(0.0)
|
||||
}
|
||||
|
||||
#[cfg(test)]
|
||||
mod tests {
|
||||
use super::*;
|
||||
use std::borrow::Cow;
|
||||
|
||||
/// Test span implementation.
|
||||
#[derive(Debug, Clone)]
|
||||
struct TestSpan {
|
||||
text: String,
|
||||
score: Option<f32>,
|
||||
}
|
||||
|
||||
impl TestSpan {
|
||||
fn new(text: impl Into<String>, score: f32) -> Self {
|
||||
Self {
|
||||
text: text.into(),
|
||||
score: Some(score),
|
||||
}
|
||||
}
|
||||
|
||||
fn without_score(text: impl Into<String>) -> Self {
|
||||
Self {
|
||||
text: text.into(),
|
||||
score: None,
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
impl ScoredSpan for TestSpan {
|
||||
fn text(&self) -> Cow<str> {
|
||||
Cow::Borrowed(&self.text)
|
||||
}
|
||||
|
||||
fn score(&self) -> Option<f32> {
|
||||
self.score
|
||||
}
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn test_single_span() {
|
||||
let spans = vec![TestSpan::new("Test", 0.9)];
|
||||
assert_eq!(aggregate_page_readability(&spans), 0.9);
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn test_empty_page() {
|
||||
let spans: Vec<TestSpan> = vec![];
|
||||
assert_eq!(aggregate_page_readability(&spans), 0.0);
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn test_all_unscored_spans() {
|
||||
let spans = vec![
|
||||
TestSpan::without_score("text1"),
|
||||
TestSpan::without_score("text2"),
|
||||
];
|
||||
assert_eq!(aggregate_page_readability(&spans), 0.0);
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn test_mixed_scored_unscored() {
|
||||
let spans = vec![
|
||||
TestSpan::new("scored", 0.8),
|
||||
TestSpan::without_score("ignored"),
|
||||
];
|
||||
assert_eq!(aggregate_page_readability(&spans), 0.8);
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn test_char_weighted_median_example() {
|
||||
// From acceptance criteria:
|
||||
// (100 chars, 0.9), (10 chars, 0.5), (100 chars, 0.8)
|
||||
// Sorted by score: 0.5(10), 0.8(100), 0.9(100)
|
||||
// Cumsum: 10, 110, 210
|
||||
// Half = 210 / 2 = 105
|
||||
// Score at cumsum >= 105 is 0.8
|
||||
let spans = vec![
|
||||
TestSpan::new("a".repeat(100), 0.9),
|
||||
TestSpan::new("b".repeat(10), 0.5),
|
||||
TestSpan::new("c".repeat(100), 0.8),
|
||||
];
|
||||
assert_eq!(aggregate_page_readability(&spans), 0.8);
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn test_char_weighted_median_even_split() {
|
||||
// Two equal spans: median is the higher score (half point at boundary)
|
||||
let spans = vec![
|
||||
TestSpan::new("a".repeat(100), 0.5),
|
||||
TestSpan::new("b".repeat(100), 0.9),
|
||||
];
|
||||
// Total = 200, half = 100
|
||||
// Cumsum after first span = 100, not > 100
|
||||
// Cumsum after second span = 200 > 100
|
||||
// Returns 0.9
|
||||
assert_eq!(aggregate_page_readability(&spans), 0.9);
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn test_all_same_score() {
|
||||
let spans = vec![
|
||||
TestSpan::new("a", 0.8),
|
||||
TestSpan::new("b", 0.8),
|
||||
TestSpan::new("c", 0.8),
|
||||
];
|
||||
assert_eq!(aggregate_page_readability(&spans), 0.8);
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn test_empty_strings() {
|
||||
let spans = vec![
|
||||
TestSpan::new("", 0.5),
|
||||
TestSpan::new("", 0.8),
|
||||
];
|
||||
// All empty -> total_chars = 0 -> return 0.0
|
||||
assert_eq!(aggregate_page_readability(&spans), 0.0);
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn test_unicode_char_count() {
|
||||
// Test that char_count counts Unicode code points, not bytes
|
||||
let spans = vec![
|
||||
TestSpan::new("é", 0.9), // 2 bytes, 1 char
|
||||
TestSpan::new("中", 0.8), // 3 bytes, 1 char
|
||||
];
|
||||
// Each span is 1 char, total = 2, half = 1
|
||||
// Sorted by score: (0.8, 1), (0.9, 1)
|
||||
// Cumsum after first = 1, not > 1
|
||||
// Cumsum after second = 2 > 1
|
||||
// Returns second score (0.9) after sorting
|
||||
assert_eq!(aggregate_page_readability(&spans), 0.9);
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn test_longer_span_dominates() {
|
||||
// One very long span dominates the median
|
||||
let spans = vec![
|
||||
TestSpan::new("x".repeat(1000), 0.9),
|
||||
TestSpan::new("y".repeat(10), 0.1),
|
||||
TestSpan::new("z".repeat(10), 0.2),
|
||||
];
|
||||
// Total = 1020, half = 510
|
||||
// Cumsum: 10 (0.1), 20 (0.2), 1020 (0.9)
|
||||
// 1020 > 510, returns 0.9
|
||||
assert_eq!(aggregate_page_readability(&spans), 0.9);
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn test_all_perfect_scores() {
|
||||
let spans = vec![
|
||||
TestSpan::new("a".repeat(100), 1.0),
|
||||
TestSpan::new("b".repeat(100), 1.0),
|
||||
];
|
||||
assert_eq!(aggregate_page_readability(&spans), 1.0);
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn test_all_zero_scores() {
|
||||
let spans = vec![
|
||||
TestSpan::new("a", 0.0),
|
||||
TestSpan::new("b", 0.0),
|
||||
];
|
||||
assert_eq!(aggregate_page_readability(&spans), 0.0);
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn test_order_preservation() {
|
||||
// Verify that sort order doesn't affect result
|
||||
let spans1 = vec![
|
||||
TestSpan::new("a".repeat(100), 0.9),
|
||||
TestSpan::new("b".repeat(10), 0.5),
|
||||
TestSpan::new("c".repeat(100), 0.8),
|
||||
];
|
||||
|
||||
let spans2 = vec![
|
||||
TestSpan::new("c".repeat(100), 0.8),
|
||||
TestSpan::new("a".repeat(100), 0.9),
|
||||
TestSpan::new("b".repeat(10), 0.5),
|
||||
];
|
||||
|
||||
assert_eq!(aggregate_page_readability(&spans1), aggregate_page_readability(&spans2));
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn test_nan_score_handling() {
|
||||
// NaN scores should be sorted to the end (due to to_bits() ordering)
|
||||
let spans = vec![
|
||||
TestSpan::new("a".repeat(10), 0.5),
|
||||
TestSpan::new("b".repeat(10), f32::NAN),
|
||||
TestSpan::new("c".repeat(10), 0.8),
|
||||
];
|
||||
// Total = 30, half = 15
|
||||
// Sorted: 0.5(10), 0.8(10), NaN(10)
|
||||
// Cumsum: 10, 20, 30
|
||||
// 20 > 15, returns 0.8
|
||||
let result = aggregate_page_readability(&spans);
|
||||
assert!(result.is_finite());
|
||||
assert_eq!(result, 0.8);
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn test_zero_width_joiner() {
|
||||
// Test zero-width joiner and combining marks
|
||||
let spans = vec![
|
||||
TestSpan::new("café", 0.9), // 4 chars: c a f é
|
||||
TestSpan::new("नमस्ते", 0.8), // 6 chars (Hindi namaste)
|
||||
];
|
||||
// Total = 10 chars, half = 5
|
||||
// Cumsum after first = 4, not > 5
|
||||
// Cumsum after second = 10 > 5
|
||||
// Returns second score
|
||||
assert_eq!(aggregate_page_readability(&spans), 0.8);
|
||||
}
|
||||
}
|
||||
|
|
@ -274,6 +274,13 @@ pub struct ExtractionQuality {
|
|||
/// Average confidence score across all spans [0.0, 1.0].
|
||||
#[serde(skip_serializing_if = "Option::is_none")]
|
||||
pub avg_confidence: Option<f32>,
|
||||
|
||||
/// Per-page readability score (char-weighted median of span scores) [0.0, 1.0].
|
||||
///
|
||||
/// This is the median of per-span readability scores, weighted by character count.
|
||||
/// A score below 0.5 may indicate mojibake, encoding issues, or broken text layers.
|
||||
#[serde(skip_serializing_if = "Option::is_none")]
|
||||
pub readability: Option<f32>,
|
||||
}
|
||||
|
||||
impl ExtractionQuality {
|
||||
|
|
@ -285,6 +292,7 @@ impl ExtractionQuality {
|
|||
ocr_fraction: None,
|
||||
min_confidence: None,
|
||||
avg_confidence: None,
|
||||
readability: None,
|
||||
}
|
||||
}
|
||||
|
||||
|
|
@ -502,6 +510,7 @@ mod tests {
|
|||
assert_eq!(quality.ocr_fraction, None);
|
||||
assert_eq!(quality.min_confidence, None);
|
||||
assert_eq!(quality.avg_confidence, None);
|
||||
assert_eq!(quality.readability, None);
|
||||
}
|
||||
|
||||
#[test]
|
||||
|
|
@ -530,6 +539,7 @@ mod tests {
|
|||
ocr_fraction: Some(0.25),
|
||||
min_confidence: Some(0.95),
|
||||
avg_confidence: Some(0.98),
|
||||
readability: Some(0.87),
|
||||
};
|
||||
|
||||
let json = serde_json::to_string(&quality).unwrap();
|
||||
|
|
@ -540,6 +550,7 @@ mod tests {
|
|||
assert!(json.contains("ocr_fraction"));
|
||||
assert!(json.contains("min_confidence"));
|
||||
assert!(json.contains("avg_confidence"));
|
||||
assert!(json.contains("readability"));
|
||||
}
|
||||
|
||||
#[test]
|
||||
|
|
@ -551,6 +562,7 @@ mod tests {
|
|||
ocr_fraction: None,
|
||||
min_confidence: None,
|
||||
avg_confidence: None,
|
||||
readability: None,
|
||||
};
|
||||
|
||||
let json = serde_json::to_string(&quality).unwrap();
|
||||
|
|
@ -562,6 +574,7 @@ mod tests {
|
|||
assert!(!json.contains("ocr_fraction"));
|
||||
assert!(!json.contains("min_confidence"));
|
||||
assert!(!json.contains("avg_confidence"));
|
||||
assert!(!json.contains("readability"));
|
||||
}
|
||||
|
||||
#[test]
|
||||
|
|
|
|||
83
notes/pdftract-oh30a.md
Normal file
83
notes/pdftract-oh30a.md
Normal file
|
|
@ -0,0 +1,83 @@
|
|||
# pdftract-oh30a: Per-page readability aggregation (median weighted by char count)
|
||||
|
||||
## Implementation Summary
|
||||
|
||||
Implemented `aggregate_page_readability()` function that computes per-page readability as the char-weighted median of per-span scores.
|
||||
|
||||
### Files Changed
|
||||
|
||||
1. **Created** `crates/pdftract-core/src/layout/readability.rs`:
|
||||
- `ScoredSpan` trait for abstracting over different span representations
|
||||
- `aggregate_page_readability<T: ScoredSpan>()` function
|
||||
- Char-weighted median algorithm:
|
||||
- Collect `(score, char_count)` pairs from spans
|
||||
- Sort by score ascending
|
||||
- Compute cumulative character count
|
||||
- Return score at half-total-char point
|
||||
- Edge case handling: empty page (0.0), single span, all empty strings
|
||||
|
||||
2. **Modified** `crates/pdftract-core/src/layout/mod.rs`:
|
||||
- Added `pub mod readability;`
|
||||
- Exported `aggregate_page_readability` and `ScoredSpan`
|
||||
|
||||
3. **Modified** `crates/pdftract-core/src/schema/mod.rs`:
|
||||
- Added `readability: Option<f32>` field to `ExtractionQuality`
|
||||
- Updated `ExtractionQuality::new()` to initialize `readability: None`
|
||||
- Updated tests to include the new field
|
||||
|
||||
### Algorithm
|
||||
|
||||
The char-weighted median correctly weights longer spans more heavily:
|
||||
- Sort spans by score (ascending)
|
||||
- Walk sorted list accumulating character counts
|
||||
- Return the score at the position where cumulative count exceeds half the total
|
||||
|
||||
Example from acceptance criteria:
|
||||
- Spans: (100 chars, 0.9), (10 chars, 0.5), (100 chars, 0.8)
|
||||
- Sorted: 0.5(10), 0.8(100), 0.9(100)
|
||||
- Cumsum: 10, 110, 210
|
||||
- Half = 105
|
||||
- Score at cumsum >= 105 is **0.8** ✓
|
||||
|
||||
### Test Results
|
||||
|
||||
All readability module tests PASS (15/15):
|
||||
- ✓ `test_single_span` - Single span returns its score
|
||||
- ✓ `test_empty_page` - Empty page returns 0.0
|
||||
- ✓ `test_all_unscored_spans` - No scored spans returns 0.0
|
||||
- ✓ `test_mixed_scored_unscored` - Unscored spans excluded
|
||||
- ✓ `test_char_weighted_median_example` - AC example from bead
|
||||
- ✓ `test_char_weighted_median_even_split` - Equal spans
|
||||
- ✓ `test_all_same_score` - All same score returns that score
|
||||
- ✓ `test_empty_strings` - All empty strings returns 0.0
|
||||
- ✓ `test_unicode_char_count` - Counts Unicode code points correctly
|
||||
- ✓ `test_longer_span_dominates` - Long spans dominate median
|
||||
- ✓ `test_all_perfect_scores` - All 1.0 returns 1.0
|
||||
- ✓ `test_all_zero_scores` - All 0.0 returns 0.0
|
||||
- ✓ `test_order_preservation` - Result independent of input order
|
||||
- ✓ `test_nan_score_handling` - NaN scores handled gracefully
|
||||
- ✓ `test_zero_width_joiner` - Combining marks counted correctly
|
||||
|
||||
### Validation
|
||||
|
||||
- [x] Code compiles: `cargo check --all-targets` ✓
|
||||
- [x] All layout tests pass: `cargo test --lib layout` ✓ (53/53 passed)
|
||||
- [x] All schema tests pass: `cargo test --lib schema` ✓ (26/26 passed)
|
||||
- [x] Algorithm matches acceptance criteria exactly
|
||||
|
||||
### Commit
|
||||
|
||||
Files to commit:
|
||||
- `crates/pdftract-core/src/layout/readability.rs` (new)
|
||||
- `crates/pdftract-core/src/layout/mod.rs` (modified)
|
||||
- `crates/pdftract-core/src/schema/mod.rs` (modified)
|
||||
|
||||
### Closing the bead
|
||||
|
||||
All acceptance criteria PASS:
|
||||
- ✓ Page with 1 span of 100 chars at score 0.9: page score = 0.9
|
||||
- ✓ Page with 3 spans: (100 chars, 0.9), (10 chars, 0.5), (100 chars, 0.8): char-weighted median = 0.8
|
||||
- ✓ Empty page: page score = 0.0 (default)
|
||||
- ✓ All-perfect spans: page score = 1.0
|
||||
|
||||
Ready to close.
|
||||
Loading…
Add table
Reference in a new issue