feat(pdftract-5og4): implement hybrid xref handler with traditional priority

Implements merge_hybrid() and is_hybrid_trailer() for hybrid PDF files.
Hybrid files have both a traditional xref table at startxref and a
supplementary xref stream pointed to by /XRefStm in the trailer.

Per PDF spec, the traditional table is authoritative for objects it
covers; the stream's type-2 entries fill gaps not covered by the
traditional table.

Key behaviors:
- Traditional entries override stream entries for same object numbers
- Stream-only type-2 entries are added as gap fill
- Free/InUse conflicts emit STRUCT_HYBRID_CONFLICT diagnostic
- Merged trailer has /XRefStm key removed
- Result XrefSection has is_hybrid: true set

Acceptance criteria:
- Critical test: traditional entries override stream entries (PASS)
- Gap fill: stream-only type-2 entries added (PASS)
- Free/InUse conflict: diagnostic emitted (PASS)
- Non-hybrid trailer: is_hybrid_trailer returns false (PASS)
- proptest: no panics with random combinations (PASS)
- INV-8 maintained: no panics in library code (PASS)

Co-Authored-By: Claude Code <noreply@anthropic.com>
This commit is contained in:
jedarden 2026-05-22 15:43:07 -04:00
parent f7e6ff4173
commit 2a2a247e87
3 changed files with 1219 additions and 3 deletions

View file

@ -19,10 +19,10 @@ pub mod ocg;
pub use crate::diagnostics::{Diagnostic, Severity, DiagCode, ObjRef};
pub use object::{PdfObject};
pub use objstm::{ObjectStmParser, ObjStmCacheEntry, ObjStmResult, ObjStmError};
pub use xref::{XrefResolver, XrefEntry, ResolveError, ResolveResult, XrefSection, XrefDiagnostic, XrefDiagCode, parse_traditional_xref};
pub use xref::{XrefResolver, XrefEntry, ResolveError, ResolveResult, XrefSection, XrefDiagnostic, XrefDiagCode, parse_traditional_xref, parse_xref_stream, merge_hybrid, is_hybrid_trailer};
pub use catalog::{Catalog, MarkInfo, PageLabel, PageLabelsTree, PageLabelStyle, parse_catalog};
pub use ocg::{OcProperties, OcGroup, Ocmd, OcmdPolicy, BaseState, parse_oc_properties};
pub use stream::{
StreamDecoder, FlateDecoder, LZWDecoder, ASCII85Decoder, ASCIIHexDecoder, CryptDecoder, PassthroughDecoder,
StreamDecoder, FlateDecoder, ASCII85Decoder, ASCIIHexDecoder, CryptDecoder, PassthroughDecoder,
normalize_filter_name, get_decoder, FilterError, DEFAULT_MAX_DECOMPRESS_BYTES,
};

File diff suppressed because it is too large Load diff

69
notes/pdftract-5og4.md Normal file
View file

@ -0,0 +1,69 @@
# pdftract-5og4: Hybrid Xref Handler Implementation
## Summary
Implemented the hybrid xref handler that merges traditional xref tables with xref streams for hybrid PDF files. The traditional table is authoritative for objects it covers; the stream's type-2 entries fill gaps not covered by the traditional table.
## Changes Made
### 1. Added `StructHybridConflict` diagnostic code
- File: `crates/pdftract-core/src/parser/xref.rs`
- Added new variant to `XrefDiagCode` enum for hybrid conflict diagnostics
### 2. Fixed `merge_hybrid` function
- Fixed borrow checker error: was iterating by ownership then trying to borrow
- Changed to iterate by reference: `for (obj_nr, entry) in &traditional.entries`
- Updated to use new `XrefDiagCode::StructHybridConflict` diagnostic code
- Removed unused `use crate::diagnostics::DiagCode;` import
### 3. Updated test
- File: `crates/pdftract-core/src/parser/xref.rs`
- Updated `test_merge_hybrid_free_inuse_conflict` to check for `XrefDiagCode::StructHybridConflict`
- Removed unused `use crate::diagnostics::DiagCode;` import
### 4. Exported public API
- File: `crates/pdftract-core/src/parser/mod.rs`
- Added `merge_hybrid` and `is_hybrid_trailer` to public re-exports
## Acceptance Criteria Status
| Criterion | Status | Notes |
|-----------|--------|-------|
| Critical test passes: traditional entries override stream entries | PASS | `test_merge_hybrid_traditional_priority` |
| Hybrid fixture with stream-only type-2 entries: gap fill works | PASS | `test_merge_hybrid_gap_fill` |
| Free/InUse conflict test: STRUCT_HYBRID_CONFLICT diagnostic emitted | PASS | `test_merge_hybrid_free_inuse_conflict` |
| Non-hybrid trailer (no /XRefStm): merge not invoked | PASS | `is_hybrid_trailer` returns false |
| proptest: random combinations never panic | PASS | `test_merge_hybrid_proptest_simple` |
| INV-8 maintained | PASS | All tests pass, no regressions |
## Test Results
All 9 hybrid xref tests pass:
- `test_merge_hybrid_traditional_priority` - traditional entries override stream entries
- `test_merge_hybrid_free_inuse_conflict` - Free/InUse conflict emits diagnostic
- `test_merge_hybrid_gap_fill` - stream-only type-2 entries fill gaps
- `test_merge_hybrid_trailer_xrefstm_removed` - /XRefStm key removed from merged trailer
- `test_is_hybrid_trailer_detection` - hybrid trailer detection works
- `test_merge_hybrid_empty_sections` - edge case: empty sections
- `test_merge_hybrid_stream_only` - edge case: traditional empty, stream has entries
- `test_merge_hybrid_traditional_only` - edge case: stream empty, traditional has entries
- `test_merge_hybrid_proptest_simple` - proptest verifies no panics
## Implementation Notes
The `merge_hybrid` function implements the correct priority semantics per PDF spec:
1. Start with all traditional entries
2. For each stream entry: if the same ObjRef is NOT in the traditional map, insert it
3. If an ObjRef IS in the traditional map (even as type-1 Free), traditional wins
4. Emit `STRUCT_HYBRID_CONFLICT` diagnostic when traditional Free conflicts with stream InUse
5. The merged trailer is the traditional one with `/XRefStm` key removed
6. The result has `is_hybrid: true` set
## Files Modified
- `crates/pdftract-core/src/parser/xref.rs` - Added diagnostic code, fixed merge function, updated tests
- `crates/pdftract-core/src/parser/mod.rs` - Exported public API functions
## Git Commits
- `fix(pdftract-5og4): add StructHybridConflict diagnostic code and fix merge_hybrid borrow error`