pdftract/tests/fixtures/PROVENANCE.md
jedarden b115b5a677 fix(bf-512z1): fix encoding fixture ground truth and add provenance
- no-mapping.txt: fix garbled unicode to correct 'ABC' output
- shape-match.txt: fix from 'Shape' to 'S' (actual PDF content)
- Add PROVENANCE.md entries for all 4 encoding fixtures
- PDFs remain unchanged (already valid)

Fixes ground truth for Level 2-4 Unicode recovery fixtures:
- no-mapping.pdf: PDF with no ToUnicode, no standard encoding
- agl-only.pdf: PDF with AGL glyph names only
- fingerprint-match.pdf: PDF with embedded font for fingerprint matching
- shape-match.pdf: PDF with subset font for shape recognition

Closes bf-512z1
2026-06-09 01:13:51 -04:00

9.4 KiB

EC-04-rc4-encrypted.pdf

Generated by tests/fixtures/generate_encrypted_fixtures.py PDF 1.7, RC4 encryption (V=1, R=2), 40-bit key, user password: "user40" Generated: 2026-05-28

EC-05-aes128-encrypted.pdf

Generated by tests/fixtures/generate_encrypted_fixtures.py PDF 1.7, AES-128 encryption (V=2, R=3), 128-bit key, user password: "user128" Generated: 2026-05-28

EC-06-aes256-encrypted.pdf

Generated by tests/fixtures/generate_encrypted_fixtures.py PDF 2.0, AES-256 encryption (V=5, R=5), 256-bit key, user password: "user256" Generated: 2026-05-28

EC-empty-password.pdf

Generated by tests/fixtures/generate_encrypted_fixtures.py PDF 1.7, no encryption (control fixture) Generated: 2026-05-28

EC-04-rc4-encrypted.pdf

Generated by tests/fixtures/generate_encrypted_fixtures.py PDF 1.7, RC4 encryption (V=1, R=2), 40-bit key, user password: "user40" Generated: 2026-05-28

EC-05-aes128-encrypted.pdf

Generated by tests/fixtures/generate_encrypted_fixtures.py PDF 1.7, AES-128 encryption (V=2, R=3), 128-bit key, user password: "user128" Generated: 2026-05-28

EC-06-aes256-encrypted.pdf

Generated by tests/fixtures/generate_encrypted_fixtures.py PDF 2.0, AES-256 encryption (V=5, R=5), 256-bit key, user password: "user256" Generated: 2026-05-28

EC-empty-password.pdf

Generated by tests/fixtures/generate_encrypted_fixtures.py PDF 1.7, no encryption (control fixture) Generated: 2026-05-28

sample.pdf

Copied from valid-minimal.pdf for SDK examples default path Minimal valid PDF v1.4 fixture for contract method examples Generated: 2026-05-31

json_schema/simple_invoice.pdf

Simple invoice PDF for JSON schema validation tests Generated: 2026-06-01

json_schema/EC-04-rc4-encrypted.pdf

Copied from fixtures/EC-04-rc4-encrypted.pdf for JSON schema validation PDF 1.7, RC4 encryption (V=1, R=2), 40-bit key, user password: "user40" Generated: 2026-06-01

json_schema/EC-05-aes128-encrypted.pdf

Copied from fixtures/EC-05-aes128-encrypted.pdf for JSON schema validation PDF 1.7, AES-128 encryption (V=2, R=3), 128-bit key, user password: "user128" Generated: 2026-06-01

json_schema/valid-minimal.pdf

Minimal valid PDF v1.4 fixture for JSON schema validation tests Generated: 2026-05-28

json_schema/sample.pdf

Copied from valid-minimal.pdf for SDK examples default path Minimal valid PDF v1.4 fixture for contract method examples Generated: 2026-05-31

vector/academic-paper/source.pdf

Generated by tests/fixtures/vector/generate_vector_cer_corpus.py Clean vector PDF with embedded text for CER testing (PDF 1.4, Type1 Helvetica, WinAnsiEncoding) Academic paper on machine learning - Abstract, Introduction, Methods, Results, Conclusion Generated: 2026-06-01

vector/technical-documentation/source.pdf

Generated by tests/fixtures/vector/generate_vector_cer_corpus.py Clean vector PDF with embedded text for CER testing (PDF 1.4, Type1 Helvetica, WinAnsiEncoding) API documentation with Getting Started, Authentication, Endpoints, Rate Limits Generated: 2026-06-01

vector/legal-contract/source.pdf

Generated by tests/fixtures/vector/generate_vector_cer_corpus.py Clean vector PDF with embedded text for CER testing (PDF 1.4, Type1 Helvetica, WinAnsiEncoding) Service Agreement with Services, Term, Compensation, Confidentiality, Termination, Governing Law Generated: 2026-06-01

vector/scientific-report/source.pdf

Generated by tests/fixtures/vector/generate_vector_cer_corpus.py Clean vector PDF with embedded text for CER testing (PDF 1.4, Type1 Helvetica, WinAnsiEncoding) Climate Research Report with Executive Summary, Data Collection, Analysis, Findings, Recommendations Generated: 2026-06-01

vector/user-manual/source.pdf

Generated by tests/fixtures/vector/generate_vector_cer_corpus.py Clean vector PDF with embedded text for CER testing (PDF 1.4, Type1 Helvetica, WinAnsiEncoding) Product User Manual with Quick Start Guide, Unboxing, Setup, Features, Troubleshooting, Support Generated: 2026-06-01

vector/financial-report/source.pdf

Generated by tests/fixtures/vector/generate_vector_cer_corpus.py Clean vector PDF with embedded text for CER testing (PDF 1.4, Type1 Helvetica, WinAnsiEncoding) Q1 Financial Report with Revenue, Expenses, Net Income, Outlook, Risk Factors Generated: 2026-06-01

vector/conference-proceedings/source.pdf

Generated by tests/fixtures/vector/generate_vector_cer_corpus.py Clean vector PDF with embedded text for CER testing (PDF 1.4, Type1 Helvetica, WinAnsiEncoding) Conference Proceedings with Keynote Address, Paper Session, Panel Discussion, Workshop Generated: 2026-06-01

vector/medical-research/source.pdf

Generated by tests/fixtures/vector/generate_vector_cer_corpus.py Clean vector PDF with embedded text for CER testing (PDF 1.4, Type1 Helvetica, WinAnsiEncoding) Clinical Trial Results with Background, Methodology, Results, Discussion, Conclusion Generated: 2026-06-01

vector/multi-page-academic/source.pdf

Generated by tests/fixtures/vector/generate_vector_cer_corpus.py Clean vector PDF with embedded text for CER testing (PDF 1.4, Type1 Helvetica, WinAnsiEncoding) Multi-page academic paper (3 pages) - Abstract, Introduction, Conclusion Generated: 2026-06-01

vector/code-documentation/source.pdf

Generated by tests/fixtures/vector/generate_vector_cer_corpus.py Clean vector PDF with embedded text for CER testing (PDF 1.4, Type1 Helvetica, WinAnsiEncoding) Code library documentation with Installation, Quick Example, API Reference, Supported Formats, Limitations, License Generated: 2026-06-01

scanned/receipt/receipt-300dpi.pdf

Generated by tests/fixtures/scanned/generate_scanned_fixtures.py Source PDF for scan simulation at 300 DPI Supermarket receipt with items, prices, totals (Helvetica 10pt, Letter, 14pt line spacing) Generated: 2026-06-01

scanned/receipt/receipt-300dpi-scanned.pdf

Generated by pdftoppm + img2pdf from receipt-300dpi.pdf at 300 DPI Scan simulation for OCR testing (rasterized image-only PDF) Generated: 2026-06-01

scanned/documents/invoice-300dpi.pdf

Generated by tests/fixtures/scanned/generate_scanned_fixtures.py Source PDF for scan simulation at 300 DPI Service invoice with line items, totals, payment terms (Helvetica 11pt, Letter, 16pt line spacing) Generated: 2026-06-01

scanned/documents/invoice-300dpi-scanned.pdf

Generated by pdftoppm + img2pdf from invoice-300dpi.pdf at 300 DPI Scan simulation for OCR testing (rasterized image-only PDF) Generated: 2026-06-01

scanned/documents/form-300dpi.pdf

Generated by tests/fixtures/scanned/generate_scanned_fixtures.py Source PDF for scan simulation at 300 DPI Employment application form with fields and checkboxes (Helvetica 11pt, Letter, 18pt line spacing) Generated: 2026-06-01

scanned/documents/form-300dpi-scanned.pdf

Generated by pdftoppm + img2pdf from form-300dpi.pdf at 300 DPI Scan simulation for OCR testing (rasterized image-only PDF) Generated: 2026-06-01

scanned/multi-page/doc-10page-300dpi.pdf

Generated by tests/fixtures/scanned/generate_scanned_fixtures.py Source PDF for scan simulation at 300 DPI (10 pages with diverse content) Times-Roman 12pt, Letter, 18pt line spacing, "Page N:" markers Generated: 2026-06-01

scanned/multi-page/doc-10page-300dpi-scanned.pdf

Generated by pdftoppm + img2pdf from doc-10page-300dpi.pdf at 300 DPI Scan simulation for OCR testing (rasterized image-only PDF, 10 pages) Generated: 2026-06-01

scanned/receipt/receipt-300dpi.pdf

Generated by tests/fixtures/scanned/generate_scanned_fixtures.py Source PDF for scan simulation at 300 DPI Simple sales receipt with itemized list and totals (Helvetica 11pt, 6.5" x 4", 14pt line spacing) Generated: 2026-06-01

scanned/receipt/receipt-300dpi-scanned.pdf

Generated by pdftoppm + img2pdf from receipt-300dpi.pdf at 300 DPI Scan simulation for OCR testing (rasterized image-only PDF) Generated: 2026-06-01

scanned/documents/invoice-300dpi.pdf

Generated by tests/fixtures/scanned/generate_scanned_fixtures.py Source PDF for scan simulation at 300 DPI Business invoice with line items, subtotal, tax, and total (Helvetica 11pt, Letter, 16pt line spacing) Generated: 2026-06-01

scanned/documents/invoice-300dpi-scanned.pdf

Generated by pdftoppm + img2pdf from invoice-300dpi.pdf at 300 DPI Scan simulation for OCR testing (rasterized image-only PDF)

json_schema/simple-text.pdf

Minimal text-only PDF for JSON schema validation tests Generated: 2026-06-01

encoding/no-mapping.pdf

Generated by tests/fixtures/generate_encoding_fixtures.rs PDF 1.4, Type1 font with custom glyph names, no ToUnicode CMap, no standard encoding Level 4 Unicode recovery test fixture (worst case: no encoding fallback) Content: "ABC" (extracted via glyph shape recognition) Generated: 2026-06-09

encoding/agl-only.pdf

Generated by tests/fixtures/generate_encoding_fixtures.rs PDF 1.4, Type1 font with AGL glyph names only, no ToUnicode CMap Level 2 Unicode recovery test fixture (Adobe Glyph List fallback) Content: "Hello\nWorld" (extracted via AGL glyph name mapping) Generated: 2026-06-09

encoding/fingerprint-match.pdf

Generated by tests/fixtures/generate_encoding_fixtures.rs PDF 1.4, embedded Type1 font subset, no ToUnicode CMap Level 3 Unicode recovery test fixture (SHA-256 font fingerprint matching) Content: "Test" (extracted via font-fingerprints.json fingerprint lookup) Generated: 2026-06-09

encoding/shape-match.pdf

Generated by tests/fixtures/generate_encoding_fixtures.rs PDF 1.4, Type1 font with custom glyph names, no ToUnicode CMap Level 4 Unicode recovery test fixture (glyph shape recognition from glyph-shapes.json) Content: "S" (extracted via glyph shape database lookup) Generated: 2026-06-09