From 92e6196ac53373ab5f4c64ef9cea1bec2c9e865f Mon Sep 17 00:00:00 2001 From: jedarden Date: Sat, 16 May 2026 16:24:21 -0400 Subject: [PATCH] Add research: Ruby/furigana typography, PDF/VT variable printing Two new research documents covering Japanese Ruby text and East Asian typography (tagged/untagged furigana extraction, Kinsoku Shori spacing, full-width normalization, tate-chu-yoko, CJK/Latin boundary detection, ruby_text output field) and PDF/VT variable and transactional printing (DPart hierarchy traversal, per-record extraction model, DPM metadata, variable vs. static content classification, postal address extraction, records array output schema). Co-Authored-By: Claude Sonnet 4.6 --- .../pdfvt-variable-transactional-printing.md | 74 ++++++++++++ .../ruby-text-and-east-asian-typography.md | 114 ++++++++++++++++++ 2 files changed, 188 insertions(+) create mode 100644 docs/research/pdfvt-variable-transactional-printing.md create mode 100644 docs/research/ruby-text-and-east-asian-typography.md diff --git a/docs/research/pdfvt-variable-transactional-printing.md b/docs/research/pdfvt-variable-transactional-printing.md new file mode 100644 index 0000000..bd50596 --- /dev/null +++ b/docs/research/pdfvt-variable-transactional-printing.md @@ -0,0 +1,74 @@ +# PDF/VT Variable and Transactional Printing Document Extraction + +## Overview + +PDF/VT is an ISO standard (ISO 16612-2) designed specifically for variable and transactional printing workflows. It exists in two conformance levels: PDF/VT-1, which is a single self-contained file based on PDF/X-4, and PDF/VT-2, which supports a file set where page content may reference external files via Reference XObjects. The standard targets high-volume personalized output: direct mail campaigns, monthly billing statements, investment account summaries, insurance policy documents, and utility invoices. A single PDF/VT file may contain thousands of recipient records — each spanning one or more pages — all packed into one bytestream. This structure imposes extraction challenges that flat-page models cannot address adequately. + +Where a standard PDF represents a single coherent document, a PDF/VT file is better understood as a batch container. Each record within it is a logically independent document addressed to one recipient. The pages of one record are not meaningfully related to the pages of the next. Extracting the file as a flat sequence of pages and concatenating text produces a result that is structurally meaningless for downstream use: addresses interleave with unrelated balances, transaction rows from different accounts merge into a single stream. pdftract must treat PDF/VT as a record-oriented format and surface an extraction model that matches its intended semantics. + +## DPart Hierarchy and Record Enumeration + +The structural backbone of PDF/VT is the Document Part (DPart) tree. The document catalog contains a `/DPartRoot` entry pointing to the root of this tree. Each node in the tree is a DPart dictionary; leaf nodes represent individual recipient records. Interior nodes can group records (by region, product line, or processing batch) but the extractable data lives at the leaves. + +Each DPart dictionary carries a `/Start` and `/End` entry indicating the first and last page numbers of the pages belonging to that part. To enumerate records, pdftract must walk the DPart tree from the root, recursively following `/DParts` arrays at interior nodes, and collect all leaf nodes in document order. The page range `[Start, End]` for each leaf defines exactly which pages belong to that recipient's document. The DPart tree guarantees that these ranges are non-overlapping and together cover all pages in the file. + +The traversal logic cannot assume a fixed tree depth. A billing run may use a two-level tree (root → records), while a more complex campaign may insert grouping levels (root → region → record-batch → records). pdftract's DPart walker must handle arbitrary depth and treat any node with no `/DParts` array as a leaf regardless of depth. + +## Document Part Metadata + +Each DPart node may carry a `/DPM` (Document Part Metadata) dictionary. At leaf nodes, this dictionary is the primary source of structured per-record data. The `/DPM` dictionary is not arbitrary — it follows an XMP-based schema convention. For transactional documents, it commonly encodes account number, recipient name, mailing address, statement period, amount due, and any segmentation variables used during composition. These fields are present as XMP property paths within an embedded metadata stream. + +pdftract should extract the DPM at each leaf DPart and surface it as structured metadata alongside the text content of that record's pages. Because XMP is XML-based, the extraction path is: locate the `/DPM` dict entry in the leaf DPart, retrieve the associated metadata stream, parse the XMP XML, and flatten the relevant namespaces into key-value pairs. The exact namespaces are document-specific — PDF/VT does not mandate a universal schema — so pdftract should emit the raw namespace-prefixed keys and let callers filter for what they need. + +This metadata is authoritative for fields like account number and recipient ID. It was written by the composition system before printing and is more reliable than text extracted from the rendered page, which may be subject to font substitution, encoding issues, or layout-driven truncation. + +## Variable vs. Static Content + +PDF/VT separates variable from static content through two mechanisms: Form XObjects and Reference XObjects. A Form XObject is a self-contained content stream stored once in the file and rendered by reference. A page content stream for one record's page may invoke `/Do` operators to draw the company letterhead, legal footer, or column headers — all stored as Form XObjects that are shared across every record in the file. The variable portion (recipient name, account balance, transaction rows) appears directly in the page content stream or in record-specific Form XObjects referenced only from that record's pages. + +For text extraction, this distinction matters because text within a shared Form XObject is static — identical for every recipient — while text in the page's own content stream or in record-local XObjects is the variable payload. pdftract should track XObject usage during extraction and annotate text spans with a `source` field indicating whether the text originates from a shared Form XObject, a record-specific Form XObject, or directly from the page stream. This allows downstream consumers to suppress boilerplate and focus on variable content. + +Identifying shared Form XObjects requires tracking which XObjects are referenced from more than one DPart's page set. pdftract can build a reference map during a first pass: for each Form XObject in the file, record the set of pages that invoke it via `/Do`. After DPart enumeration, XObjects invoked from pages belonging to multiple distinct records are static. XObjects invoked exclusively from pages within a single record are record-specific. + +## Reference XObjects in PDF/VT-2 + +PDF/VT-2 allows page content to reference content stored in external files via Reference XObjects. A Reference XObject has `/Subtype /Reference` and carries a `/F` entry pointing to an external file specification and a `/Page` entry indicating which page of that file to use. This enables large static assets (template forms, product images, legal blocks) to live in separate files shared across many PDF/VT-2 print jobs without being duplicated. + +pdftract operating in single-file mode — its primary mode for PDF/VT-1 — will not encounter Reference XObjects with external targets. When processing a PDF/VT-2 file, the external files may or may not be present alongside the primary file. pdftract should detect Reference XObjects during content stream parsing. When the referenced file is accessible, pdftract can resolve and inline the referenced content for extraction purposes. When the file is not present, pdftract should record the reference in the output (file specification string, page number) and continue rather than failing. The text contribution of an unresolved Reference XObject is noted as absent with the external reference identifier preserved. + +## Postal Address Block Extraction + +The first page of each recipient record in a transactional PDF/VT document typically contains a postal address block positioned within a specific bounding region — usually the upper-right quadrant for window-envelope compatibility, or upper-left depending on envelope format. This block contains the recipient's name and mailing address formatted for postal processing. + +pdftract should implement position-aware address extraction at the record level. Rather than relying on semantic parsing of free-form text, the extraction should identify the canonical address region by position heuristic: text runs appearing within the upper portion of page one of each record, horizontally offset to the windowed position, and typeset in a distinct font size from surrounding body text. The extracted lines within this bounding box are assembled in top-to-bottom order to form the address block. This region can be configured per document or inferred from DPM metadata if the composition system embeds the address coordinates there. + +The address block can be further parsed into structured fields (recipient name, street, city, state, postal code, country) using a lightweight address grammar. For US domestic addresses the USPS-standard structure is reliable; for international addresses, pdftract should emit the raw lines and a `country` hint derived from the last line or from DPM metadata. + +## Text That Appears Identical to Static Content + +Some PDF/VT composition engines do not use Form XObjects to separate variable from static text. Instead, they generate each page's content stream in full, repeating the static layout text alongside the variable text. In this case, the page content stream for record 47 and the stream for record 48 both contain the full text of the legal footer, column headers, and section titles — copied verbatim — and differ only in the variable fields. + +pdftract cannot rely on XObject structure to identify variable content in such files. The DPart tree remains the authoritative guide: text on pages within one DPart leaf belongs to one record, and that is the unit of extraction. For downstream deduplication of static text, pdftract can optionally compute text fingerprints per text run and flag runs that appear identically across more than a configurable threshold of records. These high-frequency runs are likely static template content. This analysis is a post-extraction hint, not a primary extraction feature. + +## Extraction Model and Output Schema + +The output schema for PDF/VT documents must reflect the record-oriented nature of the format. When pdftract detects a `/DPartRoot` in the document catalog, it switches to record extraction mode. The top-level output is a JSON object with a `records` array. Each element in the array corresponds to one leaf DPart and contains: + +- `record_index`: zero-based position in DPart traversal order +- `page_range`: `{ "start": N, "end": M }` using one-based page numbers matching PDF convention +- `dpm_metadata`: key-value pairs extracted from the DPM XMP stream, or null if no DPM is present +- `pages`: array of per-page extraction objects (text spans with position and font metadata, identical in structure to pdftract's standard page output) + +The document-level object also carries `dpart_depth`, the maximum depth of the DPart tree, and `record_count`, the total number of leaf DParts. If a `/DPartRoot` is absent, pdftract falls back to flat extraction mode and produces the standard single-document output without a `records` array. This fallback must always be available: not all PDF/VT generators correctly set `/DPartRoot`, and callers processing mixed batches should not require pre-classification of input files. + +## Statement and Invoice Documents + +The canonical PDF/VT use case — the monthly billing statement — illustrates all of these extraction requirements together. The static frame includes the company name and logo area (text or Form XObject), column headers for the transaction table, legal disclosure text in a reduced font size, and the payment stub layout at the bottom. The variable payload includes the account holder name and address block, account number, statement period dates, each transaction row (date, description, amount), running balance, total due, minimum payment, and payment due date. + +For pdftract, the statement extraction goal is to produce per-record JSON objects where the DPM metadata carries the authoritative account number and recipient identity, the address block extraction produces a structured postal address, and the page text spans include the transaction rows tagged with their position data so that tabular structure reconstruction can group them into rows. The transaction table is the highest-value extractable element in a statement PDF — it is the data that downstream reconciliation, audit, and analytics systems need. Correct extraction requires that table rows are associated with the correct record, not bled across a record boundary at a page seam. + +pdftract's page boundary handling in record mode must never split a record's pages when assembling text. The page sequence `[Start, End]` from the DPart leaf defines a closed interval; text from page `End` of one record and page `Start` of the next must remain in separate record objects even when those pages are physically adjacent in the PDF page tree. + +## Implementation Priority + +The foundational requirement is correct DPart tree walking and page-range assignment before any text extraction begins. All subsequent extraction — DPM metadata, address block detection, XObject classification — depends on accurate record segmentation. A PDF/VT file processed without DPart awareness produces output that is technically complete but semantically incorrect for any use case involving per-recipient data. DPart support is not an optional enhancement for pdftract; it is the minimum viable feature for correct PDF/VT handling. diff --git a/docs/research/ruby-text-and-east-asian-typography.md b/docs/research/ruby-text-and-east-asian-typography.md new file mode 100644 index 0000000..d0dd7c7 --- /dev/null +++ b/docs/research/ruby-text-and-east-asian-typography.md @@ -0,0 +1,114 @@ +# Ruby Text and East Asian Typography + +## Overview + +Japanese and broader East Asian PDFs present a distinct set of extraction challenges that go beyond the concerns of Latin-script documents. Ruby annotations (furigana), vertical writing modes, full-width character normalization, CJK punctuation conventions, and mixed-script line composition all require dedicated handling. This document specifies what pdftract must implement to extract readable, semantically accurate text from these documents. + +--- + +## 1. Ruby Text in Tagged PDFs + +The PDF specification defines a Ruby structure type for encoding phonetic annotations alongside base text. A Ruby element contains three child types: `Rb` (ruby base), `Rt` (ruby text, the phonetic gloss), and optionally `Rp` (ruby parenthesis, used as fallback delimiters in non-ruby-aware renderers). + +When pdftract processes a tagged PDF, the structure tree is traversed before any geometric analysis. On encountering a `Ruby` structure element, the extractor must: + +1. Collect all `Rb` children and concatenate their marked-content spans to form the base text string. +2. Collect all `Rt` children and concatenate their spans to form the phonetic annotation string. +3. Discard `Rp` children from the output text — they are presentational fallbacks and should not appear in the extracted result. + +The output span for a tagged Ruby element carries two distinct fields: `text` holds the base characters (e.g., the kanji), and `ruby_text` holds the phonetic annotation (e.g., the hiragana reading). These are never merged. Merging them would produce a string that interleaves kanji and kana in an order that misrepresents the document's content and breaks downstream NLP pipelines. + +--- + +## 2. Untagged Ruby: Geometric Detection + +Most Japanese PDFs in the wild are not tagged. Furigana in untagged documents appears as a cluster of small-font glyphs positioned above — or occasionally beside — the corresponding kanji. Detecting and associating these annotations requires geometric reasoning. + +The primary signal is font size ratio. Furigana glyphs are conventionally half the size of the base text, and the PDF specification's informal guidance places the typical ruby-to-base size ratio below 0.6. pdftract computes this ratio per span by comparing the font size of candidate small glyphs against the dominant font size on the line. Any span with a size ratio below 0.6 and a vertical offset placing its baseline above the base line (in default top-to-bottom coordinates, a lower y-origin in PDF space) is flagged as a ruby candidate. + +Association proceeds by horizontal overlap. For each ruby candidate span, pdftract finds the base text span or spans whose x-extent overlaps the candidate's x-extent. Where a single ruby candidate overlaps multiple base characters, the annotation is associated with the full overlapping base segment as a unit, not with individual glyphs. This mirrors the tagged structure: `Rt` annotates `Rb` as a whole, not character by character. + +Once association is confirmed, the small-font spans are removed from the main text stream and placed into `ruby_text` fields on the corresponding base spans. This prevents furigana from being emitted inline, which would corrupt word boundaries and reading order. + +--- + +## 3. Japanese Justification (Kinsoku Shori) + +Japanese typesetting enforces line-boundary constraints called kinsoku shori. Certain characters — closing brackets, closing parentheses, the ideographic period (。), the ideographic comma (、) — must not appear at the start of a line. Others — opening brackets, opening parentheses — must not appear at the end. PDF generators enforce these constraints by distributing extra inter-character spacing across the line, rather than by inserting visible gaps between words. + +pdftract's word-boundary detection must not misread kinsoku-adjusted spacing as word gaps. The correct approach is to apply a CJK-aware gap threshold: for lines where the dominant script is CJK, word boundaries are detected only when inter-glyph spacing exceeds a substantially higher fraction of the em width than would apply to Latin text. In practice, adjacent CJK characters with spacing up to roughly 0.5 em should be treated as part of the same word. Spacing introduced by kinsoku shori is generally far smaller than this threshold. + +Additionally, pdftract should detect lines where the last character is a kinsoku-prohibited opening bracket or the first character is a kinsoku-prohibited closing mark, and flag these as potential justification artifacts rather than paragraph breaks. + +--- + +## 4. Full-Width and Half-Width Character Normalization + +East Asian PDFs frequently mix full-width forms of Latin characters and digits (U+FF01–U+FF5E range) with their ASCII equivalents. Full-width Latin letters and digits are semantically identical to their half-width counterparts and should be normalized to ASCII for interoperability with downstream text processing. + +pdftract exposes this normalization as a configurable option, enabled by default. When active, full-width Latin letters (A–Z, a–z) and full-width digits (0–9) are mapped to their ASCII equivalents (A–Z, a–z, 0–9). Full-width punctuation — including the ideographic space (U+3000), fullwidth comma (U+FF0C), and fullwidth period (U+FF0E) — is preserved by default, because full-width punctuation carries typographic meaning in CJK contexts and its conversion would alter the visual and semantic representation of the source document. + +Half-width katakana (U+FF65–U+FF9F) is left unconverted: normalizing it to full-width katakana changes glyph identity in ways that are not always desirable, and the correct mapping is context-dependent. Applications requiring half-to-full katakana normalization should apply Unicode NFKC decomposition independently. + +--- + +## 5. Tate-Chu-Yoko and Vertical Mode Punctuation Rotation + +Vertical writing mode (tategumi) is covered in the multilingual document extraction research. Two Japanese-specific complications are addressed here. + +Tate-chu-yoko (縦中横) is the convention of typesetting a short horizontal sequence — typically a two- or three-digit number, a Latin abbreviation, or a year — horizontally within a vertical text column. In PDFs this appears as a text run whose individual glyphs are rotated 90 degrees relative to the surrounding vertical text, positioned so that the horizontal sequence reads left-to-right as a unit within the downward flow. + +pdftract detects tate-chu-yoko by identifying short horizontal runs (two to four glyphs) embedded in a vertical writing-mode column whose glyph transform matrices indicate a 90-degree rotation relative to the enclosing text direction. These runs are extracted as a single token and inserted into the vertical reading order at the correct position. + +Punctuation rotation is a related issue. In vertical mode, certain punctuation glyphs (commas, periods, brackets) are rotated or replaced with alternate glyph forms whose center point sits at the glyph center rather than the baseline. pdftract must account for this when computing bounding boxes and reading order for vertical text runs, using the glyph's advance vector in the text matrix rather than assuming a fixed baseline. + +--- + +## 6. CJK Punctuation Preservation + +The ideographic period (。 U+3002), corner bracket open (「 U+300C), corner bracket close (」 U+300D), ideographic comma (、 U+3001), and katakana middle dot (・ U+30FB) each carry specific semantic and typographic roles in CJK text. pdftract preserves these characters exactly as encoded. They must not be stripped, replaced with their Latin near-equivalents, or normalized by Unicode composition. + +Spacing around CJK punctuation in extraction output follows the source glyphs: if the PDF places no space before an ideographic period, none is emitted. This differs from Latin punctuation normalization, where pdftract may insert or collapse spaces around sentence-terminal marks. + +--- + +## 7. Proportional vs. Monospaced CJK Glyphs + +Traditional CJK typesetting assigns every character an advance width of exactly one em, producing a monospaced grid. Modern OpenType CJK fonts — particularly those used in web-origin or office-suite PDFs — may assign proportional advance widths, so a narrow character like ー (katakana long vowel mark) may have an advance width less than 1 em. + +This matters for word-boundary detection. pdftract's gap threshold for CJK is expressed as a fraction of the observed advance width, not a fixed em fraction. When advance widths are uniform (monospaced), the threshold is a fraction of that width. When advance widths vary, pdftract computes the gap as actual inter-glyph white space (the distance between one glyph's right edge and the next glyph's left edge) and applies the threshold to that value. This prevents proportionally-spaced CJK from generating false word breaks at every narrow character. + +--- + +## 8. Mixed CJK/Latin Line Composition + +A single line in a Japanese document may contain Latin words, numerals, or identifiers interspersed with CJK characters. The gap-detection logic must treat CJK-to-Latin and Latin-to-CJK transitions differently from Latin-to-Latin transitions. + +Between adjacent CJK characters, no gap is expected and none is inserted in the output. Between a CJK character and an adjacent Latin character (or vice versa), a thin space is conventionally added by the PDF generator. pdftract detects this thin space — typically around 0.25 em — and does not treat it as a word boundary. A full word boundary between a CJK and Latin token requires a gap substantially larger than this conventional thin space, typically exceeding 0.5 em. Latin-to-Latin word boundaries use the standard Latin gap threshold independently of the surrounding CJK context. + +--- + +## 9. Chinese Traditional vs. Simplified Detection + +The most authoritative source for distinguishing Traditional Chinese (zh-TW, zh-HK) from Simplified Chinese (zh-CN) is the `/Lang` attribute on the document catalog, the page dictionary, or individual structure elements. pdftract reads this attribute at the lowest available level — structure element first, then page, then catalog — and tags extracted spans with the resolved language code. + +When no `/Lang` attribute is present, pdftract falls back to character-set heuristics. A set of characters exists that appear only in Simplified Chinese orthography (simplified-only characters with no traditional equivalent in common use) and a complementary set for Traditional Chinese. pdftract maintains a compact lookup table of high-frequency discriminating characters. If a page's character content contains more than a configurable threshold of simplified-only characters and zero traditional-only characters, the page is labeled `zh-CN`. The reverse yields `zh-TW`. Mixed or ambiguous pages receive no language label from this heuristic; the caller is expected to treat unlabeled pages as undetermined. + +--- + +## 10. Output Representation for Ruby + +Whether ruby is detected via tagged structure or geometric inference, the output schema is uniform. Each span that carries a phonetic annotation emits: + +- `text`: the base text (kanji or base characters), used as the primary readable content. +- `ruby_text`: the phonetic annotation (hiragana or katakana reading), stored as a sibling field on the span. + +Applications that need plain text concatenate only `text` fields, producing natural-reading prose without interleaved annotations. Applications that need both readings — dictionary tools, accessibility pipelines, OCR training data generators — read `ruby_text` from the same span without requiring a separate parse pass. + +Base and ruby text are never concatenated, interleaved by character, or emitted as parenthetical inline annotations by default. An optional rendering mode for accessibility output may emit the ruby parenthesis form `base(ruby)`, but this must be an explicit opt-in and must not be the default behavior. + +--- + +## Implementation Priority + +Ruby detection (both tagged and geometric) and kinsoku-aware gap thresholds are the highest-priority items, as they directly determine whether Japanese PDF text is readable after extraction. Full-width normalization and CJK punctuation preservation are low-risk, high-correctness improvements that can be implemented early. Tate-chu-yoko detection and proportional CJK handling are lower frequency but must be addressed before pdftract is considered production-ready for Japanese documents.