All 7 sub-phases (4.1-4.7) are now fully implemented: - 4.1 Glyph to Span Merging: grouping consecutive glyphs into spans - 4.2 Line Formation: baseline clustering and direction detection - 4.3 Column Detection: histogram-based gap analysis - 4.4 Block Formation: paragraph/heading/list/table/caption/figure/code classification - 4.5 Reading Order: XY-cut algorithm with Docstrum fallback - 4.6 Output Serialization: plain text projection with configurable filters - 4.7 Text Readability: composite scoring and correction pipeline Closes pdftract-4k1x4. Verification: notes/pdftract-4k1x4.md. Changes: - extract.rs: integrate Phase 4 modules into main pipeline - layout/correction.rs: expand correction pipeline with 2048 lines of tests - layout/readability.rs: five-signal scoring with char-weighted median - text.rs: plain text serialization with page breaks and filters - span/mod.rs: Span struct with flags and confidence tracking - layout/columns.rs: column assignment to lines and spans Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
19 lines
451 B
JSON
19 lines
451 B
JSON
{
|
|
"attachments": [],
|
|
"fingerprint": "pdftract-v1:ab24a95f44ceca5d2aed4b6d056adddd8539f44c6cd6ca506534e830c82ea8a8",
|
|
"form_fields": [],
|
|
"javascript_actions": [],
|
|
"links": [],
|
|
"metadata": {
|
|
"block_count": 0,
|
|
"cache_age_seconds": null,
|
|
"cache_status": "skipped",
|
|
"page_count": 0,
|
|
"reading_order_algorithm": "xy_cut",
|
|
"span_count": 0
|
|
},
|
|
"pages": [],
|
|
"schema_version": "1.0",
|
|
"signatures": [],
|
|
"threads": []
|
|
}
|