pdftract/docs/research
jedarden f805e52fa3 Add four research documents focused on readable text production
- type3-font-extraction: CharProcs stream parsing, TeX/dvips naming
  conventions, dHash shape fingerprinting, nested font stacks, OCR fallback
- watermark-and-background-separation: five PDF watermark mechanisms,
  transparency tracking, cross-page repetition, WCAG contrast detection,
  raster inpainting, diagonal watermark removal pipeline
- historical-and-degraded-document-extraction: eight degradation categories,
  bleed-through removal, illumination correction, Sauvola binarization,
  stroke reconstruction, Fraktur/long-s handling, confidence-gated output
- complex-layout-reading-order: baseline clustering, XY-cut, Docstrum,
  RLSA smearing, mixed-layout detection, sidebar/inset/footnote ordering,
  perplexity-based confidence with natural_order fallback

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-16 15:13:10 -04:00
..
.gitkeep Initial repo scaffold with README and docs structure 2026-05-16 14:26:16 -04:00
benchmark-and-test-methodology.md Add 12 research documents covering full PDF extraction surface 2026-05-16 15:05:42 -04:00
chunking-for-llm-consumption.md Add six research documents covering output-side extraction topics 2026-05-16 14:56:25 -04:00
complex-layout-reading-order.md Add four research documents focused on readable text production 2026-05-16 15:13:10 -04:00
document-classification-and-zone-labeling.md Add six research documents covering output-side extraction topics 2026-05-16 14:56:25 -04:00
embedded-files-and-portfolios.md Add 12 research documents covering full PDF extraction surface 2026-05-16 15:05:42 -04:00
form-fields-and-annotations.md Add 12 research documents covering full PDF extraction surface 2026-05-16 15:05:42 -04:00
glyph-recognition-and-unicode-recovery.md Add research docs and SDK invocation notes 2026-05-16 14:33:34 -04:00
historical-and-degraded-document-extraction.md Add four research documents focused on readable text production 2026-05-16 15:13:10 -04:00
image-and-figure-extraction.md Add 12 research documents covering full PDF extraction surface 2026-05-16 15:05:42 -04:00
invisible-and-hidden-text.md Add 12 research documents covering full PDF extraction surface 2026-05-16 15:05:42 -04:00
language-detection-and-script-handling.md Add six research documents covering output-side extraction topics 2026-05-16 14:56:25 -04:00
malformed-pdf-repair-and-recovery.md Add 12 research documents covering full PDF extraction surface 2026-05-16 15:05:42 -04:00
mathematical-expression-handling.md Add six research documents covering output-side extraction topics 2026-05-16 14:56:25 -04:00
optional-content-groups.md Add 12 research documents covering full PDF extraction surface 2026-05-16 15:05:42 -04:00
page-geometry-and-document-structure.md Add 12 research documents covering full PDF extraction surface 2026-05-16 15:05:42 -04:00
pdf-encryption-and-security.md Add 12 research documents covering full PDF extraction surface 2026-05-16 15:05:42 -04:00
pdf-fonts-and-encoding.md Add research docs and SDK invocation notes 2026-05-16 14:33:34 -04:00
pdf-specification.md Add research docs and SDK invocation notes 2026-05-16 14:33:34 -04:00
performance-and-streaming-architecture.md Add 12 research documents covering full PDF extraction surface 2026-05-16 15:05:42 -04:00
post-extraction-normalization.md Add six research documents covering output-side extraction topics 2026-05-16 14:56:25 -04:00
post-ocr-text-correction.md Add four research documents on text quality and document-type handling 2026-05-16 15:07:30 -04:00
presentation-and-spreadsheet-pdfs.md Add four research documents on text quality and document-type handling 2026-05-16 15:07:30 -04:00
raster-ocr-pipeline.md Add 12 research documents covering full PDF extraction surface 2026-05-16 15:05:42 -04:00
semantic-text-reconstruction.md Add four research documents on text quality and document-type handling 2026-05-16 15:07:30 -04:00
table-structure-reconstruction.md Add six research documents covering output-side extraction topics 2026-05-16 14:56:25 -04:00
tagged-pdf-structure-and-reading-order.md Add research docs and SDK invocation notes 2026-05-16 14:33:34 -04:00
text-readability-validation.md Add 12 research documents covering full PDF extraction surface 2026-05-16 15:05:42 -04:00
type3-font-extraction.md Add four research documents focused on readable text production 2026-05-16 15:13:10 -04:00
watermark-and-background-separation.md Add four research documents focused on readable text production 2026-05-16 15:13:10 -04:00
xmp-and-document-metadata.md Add 12 research documents covering full PDF extraction surface 2026-05-16 15:05:42 -04:00