Implements Phase 7.1.2: StandardType -> BlockKind mapping for converting walked StructElem nodes into the BlockKind taxonomy used by Phase 4 output. Changes: - Add BlockKind enum with all output block kinds (paragraph, heading with level, table, list, list_item, figure, caption, code, block_quote, toc, formula, reference, note, form_field_struct, inline, structural_container, artifact, unknown) - Add MappingResult struct bundling block_kind, is_emitted flag, and optional diagnostic - Add structure_type_to_block_kind() function for pure type mapping - Add map_element_to_block() function as primary mapping API - Add is_artifact() placeholder for Phase 3.4 marked-content integration - Add 32 comprehensive unit tests covering all mapping paths Key features: - Complete type mapping for all 40+ PDF standard structure types - Heading level extraction: H->level 1, H1..H6->level 1..6 - Inline elements (Span, Quote) map to Inline (not emitted as blocks) - Structural containers (Document, Part, Sect, Div, etc.) map to StructuralContainer (descend without emitting) - Unknown types emit diagnostic and fall back to paragraph Acceptance criteria: - Every Standard structure type has a mapping decision - Critical test: H1/H2 -> heading level 1/2 - Unit tests: list nesting, table grouping, span passthrough - Unknown-type fallback path emits a diagnostic line Refs: Plan section 7.1 lines 2552-2553
41 lines
1.6 KiB
Rust
41 lines
1.6 KiB
Rust
//! PDF parsing primitives.
|
|
//!
|
|
//! This module provides the lexer and object parser for reading PDF documents.
|
|
|
|
pub mod diagnostic;
|
|
pub mod lexer;
|
|
pub mod object;
|
|
pub mod objstm;
|
|
pub mod xref;
|
|
pub mod catalog;
|
|
pub mod stream;
|
|
pub mod secrets;
|
|
pub mod pages;
|
|
pub mod outline;
|
|
pub mod resources;
|
|
pub mod ocg;
|
|
pub mod struct_tree;
|
|
|
|
// Re-export from the unified diagnostics module (Phase 1.6)
|
|
pub use crate::diagnostics::{Diagnostic, Severity, DiagCode, ObjRef};
|
|
pub use object::{PdfObject};
|
|
pub use objstm::{ObjectStmParser, ObjStmCacheEntry, ObjStmResult, ObjStmError};
|
|
pub use xref::{
|
|
XrefResolver, XrefEntry, ResolveError, ResolveResult, XrefSection,
|
|
parse_traditional_xref, parse_xref_stream, merge_hybrid, is_hybrid_trailer,
|
|
LinearizationInfo, detect_linearization, load_xref_linearized, merge_linearized_xrefs,
|
|
load_xref_with_prev_chain,
|
|
};
|
|
pub use catalog::{Catalog, MarkInfo, PageLabel, PageLabelsTree, PageLabelStyle, parse_catalog};
|
|
pub use ocg::{OcProperties, OcGroup, Ocmd, OcmdPolicy, BaseState, parse_oc_properties};
|
|
pub use resources::{ResourceDict, merge_resources, extract_resources};
|
|
pub use pages::{PageDict, flatten_page_tree, DEFAULT_MEDIABOX};
|
|
pub use struct_tree::{
|
|
StructureType, StructElemNode, StructTreeRoot, RoleMap, Kid,
|
|
BlockKind, MappingResult,
|
|
parse_struct_tree, structure_type_to_block_kind, map_element_to_block, is_artifact,
|
|
};
|
|
pub use stream::{
|
|
StreamDecoder, FlateDecoder, ASCII85Decoder, ASCIIHexDecoder, CryptDecoder, PassthroughDecoder,
|
|
normalize_filter_name, get_decoder, FilterError, DEFAULT_MAX_DECOMPRESS_BYTES,
|
|
};
|