pdftract/notes/pdftract-2j4zl.md
jedarden 98964e06fe fix(pdftract-2j4zl): fix header/footer duplicate counting bug
The detect_headers_and_footers function was incrementing classified_count
every time a block was classified, even if it was already classified from
a previous sliding window iteration. With 10 pages and identical headers,
blocks on pages 1-9 would be reclassified multiple times (31 classifications
instead of 10).

Fixed by checking if block is already "header" or "footer" before incrementing
the counter.

All 25 header_footer tests now pass.

Refs: pdftract-2j4zl

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-28 00:04:13 -04:00

3.2 KiB

Verification Note: pdftract-2j4zl (Header/footer cross-page dedup)

Summary

Fixed a bug in the header/footer detection algorithm where blocks were being counted multiple times when classified from different sliding window starting positions.

What Was Done

Bug Fix

File: crates/pdftract-core/src/layout/header_footer.rs

Issue: The detect_headers_and_footers function was incrementing classified_count every time a block was classified, even if it was already classified as a header/footer from a previous iteration. With a sliding window of 4 pages across 10 pages with identical headers, blocks on pages 1-9 would be reclassified multiple times:

  • Start page 0: classify pages 0-9 (10 classifications)
  • Start page 1: reclassify pages 1-9 (9 duplicate classifications)
  • Start page 2: reclassify pages 2-9 (8 duplicate classifications)
  • ...resulting in 31 total classifications instead of 10.

Fix: Added a check before incrementing the counter to only count blocks that are NOT already classified as "header" or "footer":

// Only count and classify if not already a header/footer
let current_kind = pages[page_idx][matching_idx].kind.as_str();
if current_kind != "header" && current_kind != "footer" {
    pages[page_idx][matching_idx].kind = kind.to_string();
    classified_count += 1;
}

Acceptance Criteria

PASS

  • 10 pages with identical "ACME Corp" in top 7%: all 10 classified as Headers
  • 3 pages with identical "Confidential" in bottom 7%: all 3 classified as Footers
  • 2 pages identical, 8 without: NOT classified (3+ consecutive required)
  • Different columns: NOT matched (position check fails)
  • Char-level Levenshtein used (Vec with generic_levenshtein)
  • 5% threshold enforced
  • 7% page-height window correctly implemented (0.93 for top, 0.07 for bottom)

Test Results

All 25 tests in layout::header_footer pass:

  • test_detect_headers_and_footers_sliding_window - 10 pages, all classified correctly (was failing before fix)
  • test_detect_headers_and_footers_three_pages_identical_header - 3 pages, all classified
  • test_detect_headers_and_footers_three_pages_similar_footer - 3 pages footer test
  • test_detect_headers_and_footers_two_pages_not_classified - 2 pages threshold test
  • test_detect_headers_and_footers_different_columns_not_matched - column position test
  • All zone classification tests (top 7%, bottom 7%, body)
  • All text similarity tests (char-level Levenshtein, 5% threshold)
  • All position matching tests (same column, full-width, different y-range)

Implementation Details

The existing implementation already included all required functionality:

  1. Sequential post-processing pass after rayon page assembly
  2. Sliding window of 4 pages (pages [i, i+1, i+2, i+3])
  3. Position windows: top 7% (y1 >= 0.93 * page_height) and bottom 7% (y0 <= 0.07 * page_height)
  4. strsim generic_levenshtein with Vec<char> for UNICODE CHAR LEVEL
  5. 5% Levenshtein threshold (distance <= max_len * 0.05)
  6. 3+ consecutive pages required for classification
  7. Same-position matching (y-range AND column or full-width)

The only change needed was fixing the duplicate counting bug.

References

  • Plan section: Phase 4.4 Sequencing note (line 1703)