Add four research documents on text quality and document-type handling
- text-readability-validation: character/word/entropy/perplexity checks,
symbol font detection, remediation decision tree, span quality metadata
- post-ocr-text-correction: error taxonomy, confusable tables, noisy channel
n-gram model, regex patterns, hyphenation, layout-based correction pipeline
- presentation-and-spreadsheet-pdfs: detection heuristics, slide structure,
bullet hierarchy, speaker notes, hairline grid detection, sheet boundaries,
cell type inference, Rust output schema
- semantic-text-reconstruction: beam search n-gram reconstruction, NER
validation, domain lexicons, cross-span consistency, abbreviation expansion,
citation repair, coherence scoring, ReconstructedSpan output schema
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>