Add four research documents focused on readable text production
- type3-font-extraction: CharProcs stream parsing, TeX/dvips naming
conventions, dHash shape fingerprinting, nested font stacks, OCR fallback
- watermark-and-background-separation: five PDF watermark mechanisms,
transparency tracking, cross-page repetition, WCAG contrast detection,
raster inpainting, diagonal watermark removal pipeline
- historical-and-degraded-document-extraction: eight degradation categories,
bleed-through removal, illumination correction, Sauvola binarization,
stroke reconstruction, Fraktur/long-s handling, confidence-gated output
- complex-layout-reading-order: baseline clustering, XY-cut, Docstrum,
RLSA smearing, mixed-layout detection, sidebar/inset/footnote ordering,
perplexity-based confidence with natural_order fallback
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>