jedarden/pdftract

Fork 0

Commit graph

Author	SHA1	Message	Date
jedarden	116db89c95	Add three research documents on routing and text reconstruction - word-boundary-reconstruction: expected position formula with Tc/Tw/Tz, TJ kerning gap detection, Td/Tm jump analysis, four space-width threshold strategies including adaptive histogram, multi-column gap discrimination - scanned-vs-vector-page-classification: four-category taxonomy, fast pre-checks, image coverage AABB computation, character density ratio, validity rate, glyph bbox plausibility, region routing map, confidence scoring with cost-aware OCR threshold - pdfa-compliance-and-extraction: ISO 19005 part/level matrix, XMP pdfaid detection, Level B/U/A guarantee implications for extraction, font embedding requirements, artifact tagging, PDF/A-3 embedded files, PdfaLevel enum with per-level fast-path branching Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-16 15:22:08 -04:00

Author

SHA1

Message

Date

jedarden

116db89c95

Add three research documents on routing and text reconstruction

- word-boundary-reconstruction: expected position formula with Tc/Tw/Tz,
  TJ kerning gap detection, Td/Tm jump analysis, four space-width threshold
  strategies including adaptive histogram, multi-column gap discrimination
- scanned-vs-vector-page-classification: four-category taxonomy, fast
  pre-checks, image coverage AABB computation, character density ratio,
  validity rate, glyph bbox plausibility, region routing map, confidence
  scoring with cost-aware OCR threshold
- pdfa-compliance-and-extraction: ISO 19005 part/level matrix, XMP
  pdfaid detection, Level B/U/A guarantee implications for extraction,
  font embedding requirements, artifact tagging, PDF/A-3 embedded files,
  PdfaLevel enum with per-level fast-path branching

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

2026-05-16 15:22:08 -04:00

1 commit