pdftract/docs
jedarden e3b72efc83 Add research: Southeast Asian scripts, OpenType MATH formula extraction
Two new research documents covering Southeast Asian script extraction
(Thai/Khmer/Myanmar/Lao/Tibetan/Ethiopic — cluster structure, no-space
word boundary policy for Thai/Lao, Zawgyi vs Unicode detection for
Myanmar, USE shaping, Tesseract fallback) and OpenType MATH table
exploitation for formula extraction (MathConstants for fraction/
subscript/radical layout, TeX OML/OMS/OMX encoding tables, MathML
output generation, GlyphAssembly reconstruction, alternative text
and MathJax XMP source recovery).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-16 16:21:48 -04:00
..
notes Add SDK architecture notes covering top 10 languages 2026-05-16 14:51:25 -04:00
plan Add research: span merging, Unicode normalization, implementation plan 2026-05-16 16:15:14 -04:00
research Add research: Southeast Asian scripts, OpenType MATH formula extraction 2026-05-16 16:21:48 -04:00