pdftract/tests/error_recovery/fixtures/gen_missing_endobj.py
jedarden 4d6fd8a4ab test(pdftract-4w0v4): implement adversarial test corpus + integration harness
Add 7 adversarial PDF fixtures exercising Phase 1 error-recovery paths:
- xref_30pct_bad_offsets.pdf: 100 objects, 30 bad xref offsets
- missing_mediabox_all_pages.pdf: 10 pages, no /MediaBox at any level
- missing_endobj.pdf: object 5 missing endobj marker
- truncated_mid_stream.pdf: FlateDecode stream truncated mid-decompression
- int_overflow_bbox.pdf: /BBox value 99999999999999999 (i32 overflow)
- nested_failure.pdf: every page has at least one diagnostic
- combined_failures.pdf: combines multiple failure modes (keystone INV-8 test)

Each fixture has a sibling .expected_diagnostics.json file with threshold
counts (>= not == per EC-07/EC-09 to tolerate drift).

Integration test harness (error_recovery_integration.rs):
- assert_diagnostic_count_at_least() helper for threshold checking
- assert_no_panic() helper using std::panic::catch_unwind for INV-8
- Individual test functions for each fixture
- Cumulative test_inv_8_no_panics_across_all_fixtures()

All 8 tests pass. INV-8 verified: zero panics across all fixtures.

Closes: pdftract-4w0v4
2026-05-25 14:30:24 -04:00

53 lines
1.2 KiB
Python

#!/usr/bin/env python3
"""Generate missing_endobj.pdf - a PDF where object 5 is missing its endobj marker."""
PDF_CONTENT = b"""%PDF-1.4
1 0 obj
<< /Type /Catalog /Pages 2 0 R >>
endobj
2 0 obj
<< /Type /Pages /Kids [3 0 R 4 0 R 5 0 R] /Count 3 >>
endobj
3 0 obj
<< /Type /Page /Parent 2 0 R /MediaBox [0 0 612 792] /Contents 6 0 R /Resources << /Font << /F1 7 0 R >> >> >>
endobj
4 0 obj
<< /Type /Page /Parent 2 0 R /MediaBox [0 0 612 792] /Contents 6 0 R /Resources << /Font << /F1 7 0 R >> >> >>
endobj
5 0 obj
<< /Type /Page /Parent 2 0 R /MediaBox [0 0 612 792] /Contents 6 0 R /Resources << /Font << /F1 7 0 R >> >> >>
6 0 obj
<< /Length 44 >>
stream
BT
/F1 12 Tf
100 700 Td
(Page 6) Tj
ET
endstream
endobj
7 0 obj
<< /Type /Font /Subtype /Type1 /BaseFont /Helvetica >>
endobj
xref
0 8
0000000000 65535 f
0000000009 00000 n
0000000058 00000 n
0000000131 00000 n
0000000274 00000 n
0000000417 00000 n
0000000560 00000 n
0000000643 00000 n
trailer
<< /Size 8 /Root 1 0 R >>
startxref
718
%%EOF
"""
with open('missing_endobj.pdf', 'wb') as f:
f.write(PDF_CONTENT)
print("Generated missing_endobj.pdf")
print("Object 5 is missing its 'endobj' marker - parser should recover and parse objects 6+")