pdftract

History

jedarden 04b60a1cf7 Add three research documents: CJK encoding, pipeline synthesis, linearization - cjk-and-asian-script-encoding: all six CJK encoding systems, Type 0 composite font pipeline, predefined CMap tables for Japan1/GB1/CNS1/Korea1, Shift-JIS/GB18030/Big5 byte structure, missing ToUnicode recovery via Adobe CID tables, full-width normalization, vertical text detection - extraction-pipeline-overview: end-to-end 9-stage synthesis referencing all 36 research documents; stages: file open, metadata, page classification, content extraction (4 sub-paths), font pipeline, span assembly, normalization and quality, supplementary content, output serialization; ASCII data-flow diagram - linearized-pdf-and-streaming: linearization dict keys, hint stream bitfield tables, first-page xref lazy parsing, HTTP range request pattern, staleness validation, incremental update interaction, NDJSON streaming, partial file extraction, lazy PageIter API with rayon par_bridge Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>		2026-05-16 15:26:36 -04:00
..
notes	Add SDK architecture notes covering top 10 languages	2026-05-16 14:51:25 -04:00
plan	Initial repo scaffold with README and docs structure	2026-05-16 14:26:16 -04:00
research	Add three research documents: CJK encoding, pipeline synthesis, linearization	2026-05-16 15:26:36 -04:00