- no-mapping.txt: fix garbled unicode to correct 'ABC' output - shape-match.txt: fix from 'Shape' to 'S' (actual PDF content) - Add PROVENANCE.md entries for all 4 encoding fixtures - PDFs remain unchanged (already valid) Fixes ground truth for Level 2-4 Unicode recovery fixtures: - no-mapping.pdf: PDF with no ToUnicode, no standard encoding - agl-only.pdf: PDF with AGL glyph names only - fingerprint-match.pdf: PDF with embedded font for fingerprint matching - shape-match.pdf: PDF with subset font for shape recognition Closes bf-512z1
9.4 KiB
EC-04-rc4-encrypted.pdf
Generated by tests/fixtures/generate_encrypted_fixtures.py PDF 1.7, RC4 encryption (V=1, R=2), 40-bit key, user password: "user40" Generated: 2026-05-28
EC-05-aes128-encrypted.pdf
Generated by tests/fixtures/generate_encrypted_fixtures.py PDF 1.7, AES-128 encryption (V=2, R=3), 128-bit key, user password: "user128" Generated: 2026-05-28
EC-06-aes256-encrypted.pdf
Generated by tests/fixtures/generate_encrypted_fixtures.py PDF 2.0, AES-256 encryption (V=5, R=5), 256-bit key, user password: "user256" Generated: 2026-05-28
EC-empty-password.pdf
Generated by tests/fixtures/generate_encrypted_fixtures.py PDF 1.7, no encryption (control fixture) Generated: 2026-05-28
EC-04-rc4-encrypted.pdf
Generated by tests/fixtures/generate_encrypted_fixtures.py PDF 1.7, RC4 encryption (V=1, R=2), 40-bit key, user password: "user40" Generated: 2026-05-28
EC-05-aes128-encrypted.pdf
Generated by tests/fixtures/generate_encrypted_fixtures.py PDF 1.7, AES-128 encryption (V=2, R=3), 128-bit key, user password: "user128" Generated: 2026-05-28
EC-06-aes256-encrypted.pdf
Generated by tests/fixtures/generate_encrypted_fixtures.py PDF 2.0, AES-256 encryption (V=5, R=5), 256-bit key, user password: "user256" Generated: 2026-05-28
EC-empty-password.pdf
Generated by tests/fixtures/generate_encrypted_fixtures.py PDF 1.7, no encryption (control fixture) Generated: 2026-05-28
sample.pdf
Copied from valid-minimal.pdf for SDK examples default path Minimal valid PDF v1.4 fixture for contract method examples Generated: 2026-05-31
json_schema/simple_invoice.pdf
Simple invoice PDF for JSON schema validation tests Generated: 2026-06-01
json_schema/EC-04-rc4-encrypted.pdf
Copied from fixtures/EC-04-rc4-encrypted.pdf for JSON schema validation PDF 1.7, RC4 encryption (V=1, R=2), 40-bit key, user password: "user40" Generated: 2026-06-01
json_schema/EC-05-aes128-encrypted.pdf
Copied from fixtures/EC-05-aes128-encrypted.pdf for JSON schema validation PDF 1.7, AES-128 encryption (V=2, R=3), 128-bit key, user password: "user128" Generated: 2026-06-01
json_schema/valid-minimal.pdf
Minimal valid PDF v1.4 fixture for JSON schema validation tests Generated: 2026-05-28
json_schema/sample.pdf
Copied from valid-minimal.pdf for SDK examples default path Minimal valid PDF v1.4 fixture for contract method examples Generated: 2026-05-31
vector/academic-paper/source.pdf
Generated by tests/fixtures/vector/generate_vector_cer_corpus.py Clean vector PDF with embedded text for CER testing (PDF 1.4, Type1 Helvetica, WinAnsiEncoding) Academic paper on machine learning - Abstract, Introduction, Methods, Results, Conclusion Generated: 2026-06-01
vector/technical-documentation/source.pdf
Generated by tests/fixtures/vector/generate_vector_cer_corpus.py Clean vector PDF with embedded text for CER testing (PDF 1.4, Type1 Helvetica, WinAnsiEncoding) API documentation with Getting Started, Authentication, Endpoints, Rate Limits Generated: 2026-06-01
vector/legal-contract/source.pdf
Generated by tests/fixtures/vector/generate_vector_cer_corpus.py Clean vector PDF with embedded text for CER testing (PDF 1.4, Type1 Helvetica, WinAnsiEncoding) Service Agreement with Services, Term, Compensation, Confidentiality, Termination, Governing Law Generated: 2026-06-01
vector/scientific-report/source.pdf
Generated by tests/fixtures/vector/generate_vector_cer_corpus.py Clean vector PDF with embedded text for CER testing (PDF 1.4, Type1 Helvetica, WinAnsiEncoding) Climate Research Report with Executive Summary, Data Collection, Analysis, Findings, Recommendations Generated: 2026-06-01
vector/user-manual/source.pdf
Generated by tests/fixtures/vector/generate_vector_cer_corpus.py Clean vector PDF with embedded text for CER testing (PDF 1.4, Type1 Helvetica, WinAnsiEncoding) Product User Manual with Quick Start Guide, Unboxing, Setup, Features, Troubleshooting, Support Generated: 2026-06-01
vector/financial-report/source.pdf
Generated by tests/fixtures/vector/generate_vector_cer_corpus.py Clean vector PDF with embedded text for CER testing (PDF 1.4, Type1 Helvetica, WinAnsiEncoding) Q1 Financial Report with Revenue, Expenses, Net Income, Outlook, Risk Factors Generated: 2026-06-01
vector/conference-proceedings/source.pdf
Generated by tests/fixtures/vector/generate_vector_cer_corpus.py Clean vector PDF with embedded text for CER testing (PDF 1.4, Type1 Helvetica, WinAnsiEncoding) Conference Proceedings with Keynote Address, Paper Session, Panel Discussion, Workshop Generated: 2026-06-01
vector/medical-research/source.pdf
Generated by tests/fixtures/vector/generate_vector_cer_corpus.py Clean vector PDF with embedded text for CER testing (PDF 1.4, Type1 Helvetica, WinAnsiEncoding) Clinical Trial Results with Background, Methodology, Results, Discussion, Conclusion Generated: 2026-06-01
vector/multi-page-academic/source.pdf
Generated by tests/fixtures/vector/generate_vector_cer_corpus.py Clean vector PDF with embedded text for CER testing (PDF 1.4, Type1 Helvetica, WinAnsiEncoding) Multi-page academic paper (3 pages) - Abstract, Introduction, Conclusion Generated: 2026-06-01
vector/code-documentation/source.pdf
Generated by tests/fixtures/vector/generate_vector_cer_corpus.py Clean vector PDF with embedded text for CER testing (PDF 1.4, Type1 Helvetica, WinAnsiEncoding) Code library documentation with Installation, Quick Example, API Reference, Supported Formats, Limitations, License Generated: 2026-06-01
scanned/receipt/receipt-300dpi.pdf
Generated by tests/fixtures/scanned/generate_scanned_fixtures.py Source PDF for scan simulation at 300 DPI Supermarket receipt with items, prices, totals (Helvetica 10pt, Letter, 14pt line spacing) Generated: 2026-06-01
scanned/receipt/receipt-300dpi-scanned.pdf
Generated by pdftoppm + img2pdf from receipt-300dpi.pdf at 300 DPI Scan simulation for OCR testing (rasterized image-only PDF) Generated: 2026-06-01
scanned/documents/invoice-300dpi.pdf
Generated by tests/fixtures/scanned/generate_scanned_fixtures.py Source PDF for scan simulation at 300 DPI Service invoice with line items, totals, payment terms (Helvetica 11pt, Letter, 16pt line spacing) Generated: 2026-06-01
scanned/documents/invoice-300dpi-scanned.pdf
Generated by pdftoppm + img2pdf from invoice-300dpi.pdf at 300 DPI Scan simulation for OCR testing (rasterized image-only PDF) Generated: 2026-06-01
scanned/documents/form-300dpi.pdf
Generated by tests/fixtures/scanned/generate_scanned_fixtures.py Source PDF for scan simulation at 300 DPI Employment application form with fields and checkboxes (Helvetica 11pt, Letter, 18pt line spacing) Generated: 2026-06-01
scanned/documents/form-300dpi-scanned.pdf
Generated by pdftoppm + img2pdf from form-300dpi.pdf at 300 DPI Scan simulation for OCR testing (rasterized image-only PDF) Generated: 2026-06-01
scanned/multi-page/doc-10page-300dpi.pdf
Generated by tests/fixtures/scanned/generate_scanned_fixtures.py Source PDF for scan simulation at 300 DPI (10 pages with diverse content) Times-Roman 12pt, Letter, 18pt line spacing, "Page N:" markers Generated: 2026-06-01
scanned/multi-page/doc-10page-300dpi-scanned.pdf
Generated by pdftoppm + img2pdf from doc-10page-300dpi.pdf at 300 DPI Scan simulation for OCR testing (rasterized image-only PDF, 10 pages) Generated: 2026-06-01
scanned/receipt/receipt-300dpi.pdf
Generated by tests/fixtures/scanned/generate_scanned_fixtures.py Source PDF for scan simulation at 300 DPI Simple sales receipt with itemized list and totals (Helvetica 11pt, 6.5" x 4", 14pt line spacing) Generated: 2026-06-01
scanned/receipt/receipt-300dpi-scanned.pdf
Generated by pdftoppm + img2pdf from receipt-300dpi.pdf at 300 DPI Scan simulation for OCR testing (rasterized image-only PDF) Generated: 2026-06-01
scanned/documents/invoice-300dpi.pdf
Generated by tests/fixtures/scanned/generate_scanned_fixtures.py Source PDF for scan simulation at 300 DPI Business invoice with line items, subtotal, tax, and total (Helvetica 11pt, Letter, 16pt line spacing) Generated: 2026-06-01
scanned/documents/invoice-300dpi-scanned.pdf
Generated by pdftoppm + img2pdf from invoice-300dpi.pdf at 300 DPI Scan simulation for OCR testing (rasterized image-only PDF)
json_schema/simple-text.pdf
Minimal text-only PDF for JSON schema validation tests Generated: 2026-06-01
encoding/no-mapping.pdf
Generated by tests/fixtures/generate_encoding_fixtures.rs PDF 1.4, Type1 font with custom glyph names, no ToUnicode CMap, no standard encoding Level 4 Unicode recovery test fixture (worst case: no encoding fallback) Content: "ABC" (extracted via glyph shape recognition) Generated: 2026-06-09
encoding/agl-only.pdf
Generated by tests/fixtures/generate_encoding_fixtures.rs PDF 1.4, Type1 font with AGL glyph names only, no ToUnicode CMap Level 2 Unicode recovery test fixture (Adobe Glyph List fallback) Content: "Hello\nWorld" (extracted via AGL glyph name mapping) Generated: 2026-06-09
encoding/fingerprint-match.pdf
Generated by tests/fixtures/generate_encoding_fixtures.rs PDF 1.4, embedded Type1 font subset, no ToUnicode CMap Level 3 Unicode recovery test fixture (SHA-256 font fingerprint matching) Content: "Test" (extracted via font-fingerprints.json fingerprint lookup) Generated: 2026-06-09
encoding/shape-match.pdf
Generated by tests/fixtures/generate_encoding_fixtures.rs PDF 1.4, Type1 font with custom glyph names, no ToUnicode CMap Level 4 Unicode recovery test fixture (glyph shape recognition from glyph-shapes.json) Content: "S" (extracted via glyph shape database lookup) Generated: 2026-06-09