Commit graph

1 commit

Author SHA1 Message Date
jedarden
04b60a1cf7 Add three research documents: CJK encoding, pipeline synthesis, linearization
- cjk-and-asian-script-encoding: all six CJK encoding systems, Type 0
  composite font pipeline, predefined CMap tables for Japan1/GB1/CNS1/Korea1,
  Shift-JIS/GB18030/Big5 byte structure, missing ToUnicode recovery via
  Adobe CID tables, full-width normalization, vertical text detection
- extraction-pipeline-overview: end-to-end 9-stage synthesis referencing
  all 36 research documents; stages: file open, metadata, page classification,
  content extraction (4 sub-paths), font pipeline, span assembly, normalization
  and quality, supplementary content, output serialization; ASCII data-flow
  diagram
- linearized-pdf-and-streaming: linearization dict keys, hint stream
  bitfield tables, first-page xref lazy parsing, HTTP range request pattern,
  staleness validation, incremental update interaction, NDJSON streaming,
  partial file extraction, lazy PageIter API with rayon par_bridge

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-16 15:26:36 -04:00