pdftract

History

jedarden f805e52fa3 Add four research documents focused on readable text production - type3-font-extraction: CharProcs stream parsing, TeX/dvips naming conventions, dHash shape fingerprinting, nested font stacks, OCR fallback - watermark-and-background-separation: five PDF watermark mechanisms, transparency tracking, cross-page repetition, WCAG contrast detection, raster inpainting, diagonal watermark removal pipeline - historical-and-degraded-document-extraction: eight degradation categories, bleed-through removal, illumination correction, Sauvola binarization, stroke reconstruction, Fraktur/long-s handling, confidence-gated output - complex-layout-reading-order: baseline clustering, XY-cut, Docstrum, RLSA smearing, mixed-layout detection, sidebar/inset/footnote ordering, perplexity-based confidence with natural_order fallback Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>		2026-05-16 15:13:10 -04:00
..
notes	Add SDK architecture notes covering top 10 languages	2026-05-16 14:51:25 -04:00
plan	Initial repo scaffold with README and docs structure	2026-05-16 14:26:16 -04:00
research	Add four research documents focused on readable text production	2026-05-16 15:13:10 -04:00