pdftract/docs/research
jedarden b805593973 Add six research documents covering output-side extraction topics
- table-structure-reconstruction: line detection, gap analysis, Hough
  transform, graph-based cell reconstruction, merged cells, multi-page tables
- mathematical-expression-handling: five encoding cases, OpenType MATH table,
  symbol font recovery, spatial heuristics, LaTeX reconstruction, fallback tiers
- language-detection-and-script-handling: UAX #24/#9, Arabic/Hebrew bidi,
  CJK vertical text, ligature normalization, whatlang/lingua integration
- document-classification-and-zone-labeling: margin heuristics, font
  clustering, cross-page recurrence, footnote/caption/sidebar detection
- post-extraction-normalization: hyphen handling, ligature expansion,
  paragraph reconstruction, Unicode normalization, pipeline ordering
- chunking-for-llm-consumption: semantic snapping, heading hierarchy,
  sliding window overlap, table chunking strategies, token budget, late chunking

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-16 14:56:25 -04:00
..
.gitkeep Initial repo scaffold with README and docs structure 2026-05-16 14:26:16 -04:00
chunking-for-llm-consumption.md Add six research documents covering output-side extraction topics 2026-05-16 14:56:25 -04:00
document-classification-and-zone-labeling.md Add six research documents covering output-side extraction topics 2026-05-16 14:56:25 -04:00
glyph-recognition-and-unicode-recovery.md Add research docs and SDK invocation notes 2026-05-16 14:33:34 -04:00
language-detection-and-script-handling.md Add six research documents covering output-side extraction topics 2026-05-16 14:56:25 -04:00
mathematical-expression-handling.md Add six research documents covering output-side extraction topics 2026-05-16 14:56:25 -04:00
pdf-fonts-and-encoding.md Add research docs and SDK invocation notes 2026-05-16 14:33:34 -04:00
pdf-specification.md Add research docs and SDK invocation notes 2026-05-16 14:33:34 -04:00
post-extraction-normalization.md Add six research documents covering output-side extraction topics 2026-05-16 14:56:25 -04:00
table-structure-reconstruction.md Add six research documents covering output-side extraction topics 2026-05-16 14:56:25 -04:00
tagged-pdf-structure-and-reading-order.md Add research docs and SDK invocation notes 2026-05-16 14:33:34 -04:00