pdftract/tests/fingerprint/fixtures/debug_content_streams.py
jedarden 928a64ebc9 [pdftract-ef6xz]: Complete fingerprint reproducibility test corpus
All 8 fixture pairs verified present:
- byte_identical/ (MATCH)
- acrobat_resave/ (MATCH)
- qpdf_resave/ (MATCH)
- pdftk_resave/ (MATCH)
- linearization_toggle/ (MATCH - KU-7)
- metadata_only/ (MATCH - ADR-008)
- content_edit_one_glyph/ (DIFFER)
- content_edit_one_paragraph/ (DIFFER)

Test file implements:
- INV-3: 100-invocation reproducibility test
- All 8 fixture pair tests
- INV-13: Format validation
- Cross-platform placeholder (CI integration pending)

All critical tests from Phase 1.7 (plan lines 1232-1237) implemented.

Closes pdftract-ef6xz
Verification: notes/pdftract-ef6xz.md

Refs:
- INV-3, INV-13, KU-7, ADR-008
- Plan Phase 1.7 lines 1214-1219, 1232-1237

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-02 13:32:26 -04:00

36 lines
1.3 KiB
Python

#!/usr/bin/env python3
"""Debug content stream extraction without decompression."""
import pikepdf
# Check the content of the two PDFs
with pikepdf.open("tests/fingerprint/fixtures/content_edit_one_glyph/v1.pdf") as pdf1:
with pikepdf.open("tests/fingerprint/fixtures/content_edit_one_glyph/v2.pdf") as pdf2:
# Get the content stream
page1 = pdf1.pages[0]
page2 = pdf2.pages[0]
print("=== v1.pdf ===")
contents1 = page1.get("/Contents")
if isinstance(contents1, pikepdf.Stream):
data1 = contents1.read_bytes()
print(f"Stream length: {len(data1)}")
print(f"Raw stream (bytes): {data1}")
print(f"Raw stream (text): {data1.decode('latin-1')}")
print(f"MD5: {data1.hex()}")
print("\n=== v2.pdf ===")
contents2 = page2.get("/Contents")
if isinstance(contents2, pikepdf.Stream):
data2 = contents2.read_bytes()
print(f"Stream length: {len(data2)}")
print(f"Raw stream (bytes): {data2}")
print(f"Raw stream (text): {data2.decode('latin-1')}")
print(f"MD5: {data2.hex()}")
print("\n=== Difference ===")
print(f"Streams are identical: {data1 == data2}")
print(f"v1 has 'World': {b'World' in data1}")
print(f"v2 has 'World': {b'World' in data2}")