Add three research documents: CJK encoding, pipeline synthesis, linearization
- cjk-and-asian-script-encoding: all six CJK encoding systems, Type 0
composite font pipeline, predefined CMap tables for Japan1/GB1/CNS1/Korea1,
Shift-JIS/GB18030/Big5 byte structure, missing ToUnicode recovery via
Adobe CID tables, full-width normalization, vertical text detection
- extraction-pipeline-overview: end-to-end 9-stage synthesis referencing
all 36 research documents; stages: file open, metadata, page classification,
content extraction (4 sub-paths), font pipeline, span assembly, normalization
and quality, supplementary content, output serialization; ASCII data-flow
diagram
- linearized-pdf-and-streaming: linearization dict keys, hint stream
bitfield tables, first-page xref lazy parsing, HTTP range request pattern,
staleness validation, incremental update interaction, NDJSON streaming,
partial file extraction, lazy PageIter API with rayon par_bridge
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>