Infrastructure and parsing: - raster-ocr-pipeline: trigger detection, preprocessing, Tesseract integration, assisted OCR, HOCR alignment, multi-language, performance - image-and-figure-extraction: XObjects, inline images, filter decoding, color spaces, geometry, form XObjects, transparency, figure detection - form-fields-and-annotations: AcroForm types, XFA, widget appearance streams, rich text, annotation text, output schema - pdf-encryption-and-security: R2-R6 key derivation, object-level decryption, permission flags, RustCrypto implementation approach - page-geometry-and-document-structure: page tree, all five page boxes, rotation, coordinate inversion, page labels, outlines, named destinations - optional-content-groups: OCG/OCMD visibility, usage dictionary, default state resolution, content stream marking, multilingual layer patterns - invisible-and-hidden-text: all 8 Tr modes, PDF/A invisible layer pattern, white-on-white, zero-opacity, clipped text, color tracking - malformed-pdf-repair-and-recovery: xref recovery, stream length repair, syntax tolerance, partial extraction, structured warnings Quality and metadata: - xmp-and-document-metadata: /Info vs XMP, all namespaces, RDF/XML parsing, conflict resolution, encrypted metadata, thumbnails - embedded-files-and-portfolios: EmbeddedFile streams, Filespec, AF relationships, Portfolio detection, ZUGFeRD/Factur-X, security - performance-and-streaming-architecture: mmap, lazy loading, NDJSON streaming, rayon parallelism, font caching, axum HTTP server - benchmark-and-test-methodology: CER/WER/TEDS metrics, corpus categories, reading order scoring, regression CI, public datasets Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
265 lines
13 KiB
Markdown
265 lines
13 KiB
Markdown
# Optional Content Groups (Layers) in PDF Extraction
|
|
|
|
## 1. OCG Overview
|
|
|
|
Optional Content Groups (OCGs) are named, independently togglable layers defined in ISO 32000-2 §8.11. Each OCG represents a logical grouping of PDF content — text, graphics, images, or annotations — that can be rendered or suppressed as a unit. Layers were introduced to support use cases such as technical drawings (where construction lines appear on a separate layer), multilingual documents (one layer per locale), and print-vs-screen variants.
|
|
|
|
From an extraction standpoint, OCGs are critical because content on an off layer **must not** be treated as visible text. A PDF viewer filters out off-layer content before rendering; an extraction library that ignores OCG state will silently include invisible text — watermarks, alternate-language duplicates, or suppressed annotations — mixed with the visible body text.
|
|
|
|
OCGs are registered in the document catalog under `/OCProperties` (ISO 32000-2 §8.11.4). The structure is:
|
|
|
|
```
|
|
/OCProperties <<
|
|
/OCGs [ ref1 ref2 ref3 ] % all OCGs in the document
|
|
/D << ... >> % default configuration dictionary
|
|
/Configs [ << ... >> ] % optional additional named configurations
|
|
>>
|
|
```
|
|
|
|
`/OCGs` lists every OCG indirect reference in the document. A conforming reader must process this array to build the initial visibility table before parsing any content stream.
|
|
|
|
---
|
|
|
|
## 2. OCG Dictionary
|
|
|
|
Each OCG is an indirect object of the form:
|
|
|
|
```
|
|
<< /Type /OCG
|
|
/Name (English Text)
|
|
/Intent /View
|
|
/Usage << ... >>
|
|
>>
|
|
```
|
|
|
|
- `/Type /OCG` — required; marks the object as an OCG.
|
|
- `/Name` — required; a UTF-16BE or PDFDocEncoding string giving a human-readable layer name (e.g., `"Background"`, `"English"`, `"HeaderFooter"`). Used for display in layer panels and, in pdftract, as the `ocg_name` tag on extracted spans.
|
|
- `/Intent` — optional; a name or array of names (`/View`, `/Design`, or application-defined). `/View` means the OCG governs visibility for screen rendering and, by convention, for extraction. `/Design` means it governs visibility in design tools. If absent, treat as `/View`.
|
|
- `/Usage` — optional dictionary; machine-readable context hints that drive automatic state computation from the `/AS` (auto-state) rules in the default configuration.
|
|
|
|
OCGs are referenced from content streams via the property list mechanism (see §6), from XObject dictionaries via `/OC`, and from annotation dictionaries via `/OC`.
|
|
|
|
---
|
|
|
|
## 3. Usage Dictionary
|
|
|
|
The `/Usage` dictionary on an OCG supplies structured metadata for automatic visibility determination. The relevant subkeys are:
|
|
|
|
- **`CreatorInfo`** — `<< /Creator (ApplicationName) /Subtype /Technical >>`. Informational; identifies the originating application and layer purpose.
|
|
- **`Language`** — `<< /Lang (fr-CA) /Preferred /ON >>`. The `/Lang` value is a BCP 47 language tag. `/Preferred` specifies `ON` or `OFF` — whether this is the preferred language layer when automatic language selection is active. Critical for multilingual PDFs: an auto-state rule with event `/View` applied to `/Language` will turn on only the preferred-language layers and turn off others.
|
|
- **`Export`** — `<< /ExportState /ON >>`. Controls layer state when the document is exported (saved as PDF). Values: `/ON` or `/OFF`.
|
|
- **`Zoom`** — `<< /min 0.5 /max 2.0 >>`. The layer is visible only when the zoom factor is within `[min, max]`. For extraction, zoom is conventionally treated as 1.0 unless the caller specifies otherwise.
|
|
- **`Print`** — `<< /Subtype /Watermark /PrintState /ON >>`. Governs layer state when printing. `/Subtype` can be `/Watermark` or application-defined. Watermark layers visible only on print should be excluded from extraction by default.
|
|
- **`View`** — `<< /ViewState /ON >>`. Explicitly sets layer state for on-screen viewing. This is the primary signal for extraction; a `/ViewState /OFF` layer is invisible on screen and should be excluded.
|
|
- **`User`** — `<< /Type /Ind /Name [(Alice)] >>`. User-based visibility; category is `/Ind` (individual) or `/Grp` (group). Rarely relevant for extraction.
|
|
- **`PageElement`** — `<< /Subtype /HF >>`. Marks the layer as containing page elements of a specific functional type. `/HF` (Header/Footer) is the defined value. Extraction policy for HF layers is covered in §9.
|
|
|
|
---
|
|
|
|
## 4. Optional Content Membership Dictionary (OCMD)
|
|
|
|
An OCMD expresses a boolean combination of OCG states. It is not an OCG itself but acts as a computed visibility gate:
|
|
|
|
```
|
|
<< /Type /OCMD
|
|
/OCGs [ ref1 ref2 ]
|
|
/P /AnyOn
|
|
>>
|
|
```
|
|
|
|
- `/OCGs` — a single OCG reference or an array. If a single reference, the OCMD is equivalent to a direct OCG reference (with policy `/AnyOn`).
|
|
- `/P` — the policy applied to the `/OCGs` set:
|
|
- `/AllOn` — visible iff every listed OCG is on.
|
|
- `/AllOff` — visible iff every listed OCG is off.
|
|
- `/AnyOn` — visible iff at least one listed OCG is on.
|
|
- `/AnyOff` — visible iff at least one listed OCG is off.
|
|
- `/VE` — optional visibility expression array (PDF 2.0, §8.11.2.3); a recursive boolean expression using `And`, `Or`, `Not` operators over OCG references. Implement `/VE` evaluation as a tree walk; fall back to `/P`+`/OCGs` if `/VE` is absent.
|
|
|
|
Resolving OCMD state:
|
|
|
|
```rust
|
|
fn resolve_ocmd(ocmd: &Ocmd, states: &HashMap<ObjId, bool>) -> bool {
|
|
let ocg_states: Vec<bool> = ocmd.ocgs.iter()
|
|
.map(|id| *states.get(id).unwrap_or(&true))
|
|
.collect();
|
|
match ocmd.policy {
|
|
Policy::AllOn => ocg_states.iter().all(|&s| s),
|
|
Policy::AllOff => ocg_states.iter().all(|&s| !s),
|
|
Policy::AnyOn => ocg_states.iter().any(|&s| s),
|
|
Policy::AnyOff => ocg_states.iter().any(|&s| !s),
|
|
}
|
|
}
|
|
```
|
|
|
|
---
|
|
|
|
## 5. Default Viewing State
|
|
|
|
The `/D` entry of `/OCProperties` is the default configuration dictionary. It establishes the initial OCG visibility table:
|
|
|
|
```
|
|
/D <<
|
|
/Name (Default)
|
|
/BaseState /OFF
|
|
/ON [ ref1 ref2 ]
|
|
/OFF [ ref3 ]
|
|
/AS [ << /Event /View /OCGs [ ref1 ref2 ] /Category [ /View /Zoom ] >> ]
|
|
/Order [ ... ]
|
|
/RBGroups [ ... ]
|
|
/Locked [ ... ]
|
|
>>
|
|
```
|
|
|
|
**Computing initial visible set:**
|
|
|
|
1. Set all OCGs to the `/BaseState` value (`ON`, `OFF`, or `Unchanged`; for the `/D` entry, `Unchanged` is equivalent to `ON`).
|
|
2. Apply the `/ON` array: set each listed OCG to on.
|
|
3. Apply the `/OFF` array: set each listed OCG to off. `/ON` and `/OFF` take explicit precedence over `/BaseState`.
|
|
4. Process `/AS` (auto-state) entries. Each entry specifies an event (e.g., `/View`), a set of OCGs, and usage categories. For each category, read the corresponding key from the OCG's `/Usage` dictionary and apply the state. For extraction, process only entries with `/Event /View`.
|
|
|
|
`/RBGroups` defines radio-button groups — only one OCG in the group may be on at a time. Honor this constraint when applying `/AS` overrides.
|
|
|
|
`/Locked` lists OCGs whose state may not be changed by the user; treat locked OCGs as fixed at their computed state.
|
|
|
|
---
|
|
|
|
## 6. Content Stream Marking
|
|
|
|
OCGs gate content in content streams through the **Marked Content** mechanism (ISO 32000-2 §14.6). The operator pair is `BDC` / `EMC`. When an OCG or OCMD governs a content region, the marking takes the form:
|
|
|
|
```
|
|
/OC /Lyr1 BDC
|
|
... text operators ...
|
|
EMC
|
|
```
|
|
|
|
where `/Lyr1` is a name that resolves via the page's `/Resources /Properties` dictionary to an OCG or OCMD indirect reference:
|
|
|
|
```
|
|
/Resources <<
|
|
/Properties <<
|
|
/Lyr1 ref_to_ocg_or_ocmd
|
|
>>
|
|
>>
|
|
```
|
|
|
|
Alternatively, the OCG dictionary can be inlined directly in the `BDC` property list:
|
|
|
|
```
|
|
/OC << /Type /OCG /Name (English) >> BDC
|
|
```
|
|
|
|
though inline objects are rare in well-formed PDFs.
|
|
|
|
**Nesting.** `BDC`/`EMC` pairs can be nested. A content stream may have an outer OCG marking a section and inner OCG markings for subsections. The visibility rule is conjunctive: a span is visible only if **all** enclosing OCG contexts are visible. Implement this as a stack:
|
|
|
|
```rust
|
|
struct OcgStack(Vec<bool>);
|
|
|
|
impl OcgStack {
|
|
fn push(&mut self, visible: bool) { self.0.push(visible); }
|
|
fn pop(&mut self) { self.0.pop(); }
|
|
fn is_visible(&self) -> bool { self.0.iter().all(|&v| v) }
|
|
}
|
|
```
|
|
|
|
On each `BDC` with an `/OC` property, resolve the referenced OCG or OCMD to a boolean and push it. On `EMC`, pop. Text operators encountered while `is_visible()` returns `false` are discarded.
|
|
|
|
---
|
|
|
|
## 7. XObject and Annotation OCG References
|
|
|
|
**Form XObjects** — a Form XObject (stream with `/Subtype /Form`) may carry an `/OC` entry:
|
|
|
|
```
|
|
<< /Type /XObject /Subtype /Form /OC ref_to_ocg ... >>
|
|
```
|
|
|
|
Before descending into the XObject's content stream to extract text, resolve the `/OC` entry. If the referenced OCG or OCMD is off, skip the entire XObject. This check is independent of any `BDC`/`EMC` marking inside the XObject itself; both must be satisfied for content to be visible.
|
|
|
|
**Annotations** — annotation dictionaries also support `/OC`:
|
|
|
|
```
|
|
<< /Type /Annot /Subtype /Widget /OC ref_to_ocg ... >>
|
|
```
|
|
|
|
For annotations with appearance streams (`/AP`), the appearance stream text is visible only if the annotation's `/OC` resolves to on. Text from invisible annotation appearances must be excluded.
|
|
|
|
---
|
|
|
|
## 8. Multilingual Layer Pattern
|
|
|
|
A common authoring pattern places translations of the same document on separate OCGs, one per locale, with identical layout geometry. The Usage dictionary's `/Language` subkey carries the BCP 47 tag:
|
|
|
|
```
|
|
OCG_EN: /Usage << /Language << /Lang (en) /Preferred /ON >> >>
|
|
OCG_FR: /Usage << /Language << /Lang (fr) /Preferred /OFF >> >>
|
|
OCG_ES: /Usage << /Language << /Lang (es) /Preferred /OFF >> >>
|
|
```
|
|
|
|
The `/AS` entry in the default configuration fires on `/Event /View` with `/Category [/Language]`, turning on the preferred language layer and turning off others.
|
|
|
|
For pdftract, extraction policy options:
|
|
|
|
- **Default locale extraction** — compute the visible set from `/D` (including `/AS` processing); only extract text from the resulting on-layers. The caller gets clean, single-language output.
|
|
- **Target locale extraction** — caller specifies a BCP 47 tag; the library overrides the state table to enable only OCGs whose `/Usage/Language/Lang` matches (exact or prefix match per BCP 47 §4.4) and disables others before extraction.
|
|
- **All-layers extraction** — extract all layers regardless of state; tag each span's `ocg_name` with the layer's `/Name` value. The caller can then filter by locale post-extraction.
|
|
|
|
When all-layers extraction is active, spans at the same position from different language layers will have overlapping bounding boxes. The caller must deduplicate or select by `ocg_name`.
|
|
|
|
---
|
|
|
|
## 9. PageElement HF Layers
|
|
|
|
The `PageElement` usage subtype `/HF` explicitly designates a layer as containing headers and/or footers. This is the PDF specification's own semantic label for running header/footer content.
|
|
|
|
```
|
|
/Usage << /PageElement << /Subtype /HF >> >>
|
|
```
|
|
|
|
Extraction policy for HF layers:
|
|
|
|
- **Default:** exclude HF-layer content from the primary body text stream; emit it in a separate `headers_footers` bucket or label spans with `zone: HeaderFooter`.
|
|
- **Explicit inclusion:** caller opts in via an extraction flag to include HF content merged into body text (rarely useful but sometimes required for form extraction).
|
|
- **Detection fallback:** if a layer has no `PageElement` usage entry but its `/Name` matches heuristics like `"Header"`, `"Footer"`, `"Running Head"`, log a warning rather than auto-excluding — only the Usage dictionary is normative.
|
|
|
|
---
|
|
|
|
## 10. Extraction Policy
|
|
|
|
### Default behavior
|
|
|
|
Extract only content on layers that are **on** in the default viewing state (computed per §5). This matches what a conforming viewer displays. No `ocg_name` metadata is emitted on spans; OCG structure is transparent to the caller.
|
|
|
|
### Extraction modes
|
|
|
|
| Mode | Description | `ocg_name` on span |
|
|
|---|---|---|
|
|
| `DefaultVisible` | Only on-layers per `/D` | absent |
|
|
| `TargetLayer(name)` | Only the named OCG by `/Name` match | absent |
|
|
| `TargetLocale(lang)` | Only OCGs matching BCP 47 tag in `/Language` | absent |
|
|
| `AllLayers` | All layers regardless of state | present |
|
|
| `AllLayersVisible` | Only on-layers, but tagged | present |
|
|
|
|
### Span metadata
|
|
|
|
When `ocg_name` tagging is active, each span carries:
|
|
|
|
```rust
|
|
pub struct Span {
|
|
pub text: String,
|
|
pub bbox: Rect,
|
|
pub ocg_name: Option<String>, // None if not inside any OCG marking
|
|
// ... other fields
|
|
}
|
|
```
|
|
|
|
`ocg_name` reflects the **innermost** named OCG in the `BDC` stack at the point the span was extracted. If a span is covered by multiple nested OCG markings, the innermost `/Name` is used; all enclosing states must be on for the span to be included in non-`AllLayers` modes.
|
|
|
|
### Implementation notes
|
|
|
|
- Build the OCG state table once per document from `/OCProperties/D`; cache it.
|
|
- Reuse the same table for all pages — OCG state is document-scoped, not page-scoped.
|
|
- The `/Configs` array provides alternative named configurations (e.g., "Print", "Screen"). Expose these to callers who need to extract against a non-default configuration.
|
|
- When `/OCProperties` is absent, treat all content as unconditionally visible (the document has no layers).
|
|
- Log unresolvable `/OC` references (dangling indirect refs) as warnings; do not silently discard the content — treat it as visible to avoid false negatives.
|