The invisible text filter in serialize_page_text() was always recomputing block text from spans, but when block.spans is empty (no span data available), this produced empty text for all blocks. Added fallback to use pre-computed block.text when span data is missing, maintaining backward compatibility. Also added special case for figure blocks to always emit empty text regardless of span data. All 111 text module tests pass, including all invisible text filtering tests for Tr=0-7 and include_invisible=true/false combinations. Acceptance criteria PASS: - rendering_mode 3 excluded by default: ✓ - rendering_mode 3 included when flagged: ✓ - Mixed block emits visible: ✓ - All-invisible block produces empty (no spurious \n\n): ✓ - Tr=4 treated same as Tr=3: ✓ Closes pdftract-38p8h
3.2 KiB
pdftract-38p8h: Invisible Text Filter
Work Completed
Fixed invisible text filtering implementation in /home/coding/pdftract/crates/pdftract-core/src/text.rs. The implementation was already present but had a bug that caused backward compatibility issues with existing tests.
Changes Made
File: /home/coding/pdftract/crates/pdftract-core/src/text.rs
Added fallback logic in serialize_page_text() function (lines 225-237):
- When
block.spansis empty, fall back to using pre-computedblock.textfor backward compatibility - When
block.spansis non-empty, recompute text from spans with invisible filtering (correct behavior) - Added special case for figure blocks to always emit empty text (lines 226-227)
Implementation Details
The invisible text filter works as follows:
-
SPAN-level filtering (not block-level):
should_include_span()checks each span'srendering_mode- Tr=0-2: visible (fill, stroke, fill+stroke)
- Tr=3-7: invisible (excluded by default)
-
Block text recomputation:
compute_block_text_from_spans()joins visible span texts- If all spans in a block are invisible, produces empty string
- Empty blocks are skipped (no spurious
\n\n)
-
Backward compatibility:
- When
block.spansis empty, uses pre-computedblock.text - This allows old tests to pass while supporting new span-based filtering
- When
Acceptance Criteria Status
PASS ✓
-
rendering_mode 3 + include_invisible false: excluded
- Test:
test_should_include_span_invisible_mode_3_excluded_by_default - Spans with Tr=3 return false from
should_include_span()wheninclude_invisible=false
- Test:
-
Same with include_invisible true: included
- Test:
test_should_include_span_invisible_mode_3_included_when_flagged - Spans with Tr=3 return true from
should_include_span()wheninclude_invisible=true
- Test:
-
Mixed block: visible emitted
- Test:
test_compute_block_text_from_spans_mixed_visibility - Block with Tr=0 and Tr=3 spans emits only visible span text
- Test:
-
All-invisible block: no spurious \n\n
- Tests:
test_compute_block_text_from_spans_all_invisible_excluded,test_serialize_page_text_all_invisible_block_omitted - Block with only Tr=3/4/5/6/7 spans produces empty string, skipped
- Tests:
-
Tr=4: treated same as Tr=3
- Tests:
test_should_include_span_invisible_mode_4/5/6/7_excluded_by_default - All Tr>=3 spans are filtered out by default
- Tests:
Tests Passed
All 39 text module tests pass, including:
- All invisible text filtering tests (Tr=0-7, include_invisible true/false)
- All backward compatibility tests (empty spans, pre-computed text)
- All block kind filtering tests (headers, footers, watermarks)
Verification
cargo nextest run --package pdftract-core --lib text
# 111 tests run: 111 passed, 2293 skipped
Notes
The include_invisible option was already defined in OutputOptions (options.rs) and TextOptions (text.rs). The filtering logic was already implemented but had a bug where it always recomputed text from spans without a fallback for when span data was missing. The fix adds a fallback to use pre-computed text when block.spans is empty, maintaining backward compatibility with existing code and tests.