pdftract

History

jedarden b805593973 Add six research documents covering output-side extraction topics - table-structure-reconstruction: line detection, gap analysis, Hough transform, graph-based cell reconstruction, merged cells, multi-page tables - mathematical-expression-handling: five encoding cases, OpenType MATH table, symbol font recovery, spatial heuristics, LaTeX reconstruction, fallback tiers - language-detection-and-script-handling: UAX #24/#9, Arabic/Hebrew bidi, CJK vertical text, ligature normalization, whatlang/lingua integration - document-classification-and-zone-labeling: margin heuristics, font clustering, cross-page recurrence, footnote/caption/sidebar detection - post-extraction-normalization: hyphen handling, ligature expansion, paragraph reconstruction, Unicode normalization, pipeline ordering - chunking-for-llm-consumption: semantic snapping, heading hierarchy, sliding window overlap, table chunking strategies, token budget, late chunking Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>		2026-05-16 14:56:25 -04:00
..
notes	Add SDK architecture notes covering top 10 languages	2026-05-16 14:51:25 -04:00
plan	Initial repo scaffold with README and docs structure	2026-05-16 14:26:16 -04:00
research	Add six research documents covering output-side extraction topics	2026-05-16 14:56:25 -04:00