pdftract/crates/pdftract-core/src/parser/mod.rs
jedarden 0882962861 feat(pdftract-2ork): implement element-type to block-kind mapping table
Implements Phase 7.1.2: StandardType -> BlockKind mapping for converting
walked StructElem nodes into the BlockKind taxonomy used by Phase 4 output.

Changes:
- Add BlockKind enum with all output block kinds (paragraph, heading with
  level, table, list, list_item, figure, caption, code, block_quote, toc,
  formula, reference, note, form_field_struct, inline, structural_container,
  artifact, unknown)
- Add MappingResult struct bundling block_kind, is_emitted flag, and optional
  diagnostic
- Add structure_type_to_block_kind() function for pure type mapping
- Add map_element_to_block() function as primary mapping API
- Add is_artifact() placeholder for Phase 3.4 marked-content integration
- Add 32 comprehensive unit tests covering all mapping paths

Key features:
- Complete type mapping for all 40+ PDF standard structure types
- Heading level extraction: H->level 1, H1..H6->level 1..6
- Inline elements (Span, Quote) map to Inline (not emitted as blocks)
- Structural containers (Document, Part, Sect, Div, etc.) map to
  StructuralContainer (descend without emitting)
- Unknown types emit diagnostic and fall back to paragraph

Acceptance criteria:
- Every Standard structure type has a mapping decision
- Critical test: H1/H2 -> heading level 1/2
- Unit tests: list nesting, table grouping, span passthrough
- Unknown-type fallback path emits a diagnostic line

Refs: Plan section 7.1 lines 2552-2553
2026-05-23 17:24:00 -04:00

41 lines
1.6 KiB
Rust

//! PDF parsing primitives.
//!
//! This module provides the lexer and object parser for reading PDF documents.
pub mod diagnostic;
pub mod lexer;
pub mod object;
pub mod objstm;
pub mod xref;
pub mod catalog;
pub mod stream;
pub mod secrets;
pub mod pages;
pub mod outline;
pub mod resources;
pub mod ocg;
pub mod struct_tree;
// Re-export from the unified diagnostics module (Phase 1.6)
pub use crate::diagnostics::{Diagnostic, Severity, DiagCode, ObjRef};
pub use object::{PdfObject};
pub use objstm::{ObjectStmParser, ObjStmCacheEntry, ObjStmResult, ObjStmError};
pub use xref::{
XrefResolver, XrefEntry, ResolveError, ResolveResult, XrefSection,
parse_traditional_xref, parse_xref_stream, merge_hybrid, is_hybrid_trailer,
LinearizationInfo, detect_linearization, load_xref_linearized, merge_linearized_xrefs,
load_xref_with_prev_chain,
};
pub use catalog::{Catalog, MarkInfo, PageLabel, PageLabelsTree, PageLabelStyle, parse_catalog};
pub use ocg::{OcProperties, OcGroup, Ocmd, OcmdPolicy, BaseState, parse_oc_properties};
pub use resources::{ResourceDict, merge_resources, extract_resources};
pub use pages::{PageDict, flatten_page_tree, DEFAULT_MEDIABOX};
pub use struct_tree::{
StructureType, StructElemNode, StructTreeRoot, RoleMap, Kid,
BlockKind, MappingResult,
parse_struct_tree, structure_type_to_block_kind, map_element_to_block, is_artifact,
};
pub use stream::{
StreamDecoder, FlateDecoder, ASCII85Decoder, ASCIIHexDecoder, CryptDecoder, PassthroughDecoder,
normalize_filter_name, get_decoder, FilterError, DEFAULT_MAX_DECOMPRESS_BYTES,
};