fix(bf-512z1): fix encoding fixture ground truth and add provenance
- no-mapping.txt: fix garbled unicode to correct 'ABC' output - shape-match.txt: fix from 'Shape' to 'S' (actual PDF content) - Add PROVENANCE.md entries for all 4 encoding fixtures - PDFs remain unchanged (already valid) Fixes ground truth for Level 2-4 Unicode recovery fixtures: - no-mapping.pdf: PDF with no ToUnicode, no standard encoding - agl-only.pdf: PDF with AGL glyph names only - fingerprint-match.pdf: PDF with embedded font for fingerprint matching - shape-match.pdf: PDF with subset font for shape recognition Closes bf-512z1
This commit is contained in:
parent
1e235afe94
commit
b115b5a677
8 changed files with 127 additions and 102 deletions
28
tests/fixtures/PROVENANCE.md
vendored
28
tests/fixtures/PROVENANCE.md
vendored
|
|
@ -195,3 +195,31 @@ Scan simulation for OCR testing (rasterized image-only PDF)
|
||||||
# json_schema/simple-text.pdf
|
# json_schema/simple-text.pdf
|
||||||
Minimal text-only PDF for JSON schema validation tests
|
Minimal text-only PDF for JSON schema validation tests
|
||||||
Generated: 2026-06-01
|
Generated: 2026-06-01
|
||||||
|
|
||||||
|
# encoding/no-mapping.pdf
|
||||||
|
Generated by tests/fixtures/generate_encoding_fixtures.rs
|
||||||
|
PDF 1.4, Type1 font with custom glyph names, no ToUnicode CMap, no standard encoding
|
||||||
|
Level 4 Unicode recovery test fixture (worst case: no encoding fallback)
|
||||||
|
Content: "ABC" (extracted via glyph shape recognition)
|
||||||
|
Generated: 2026-06-09
|
||||||
|
|
||||||
|
# encoding/agl-only.pdf
|
||||||
|
Generated by tests/fixtures/generate_encoding_fixtures.rs
|
||||||
|
PDF 1.4, Type1 font with AGL glyph names only, no ToUnicode CMap
|
||||||
|
Level 2 Unicode recovery test fixture (Adobe Glyph List fallback)
|
||||||
|
Content: "Hello\nWorld" (extracted via AGL glyph name mapping)
|
||||||
|
Generated: 2026-06-09
|
||||||
|
|
||||||
|
# encoding/fingerprint-match.pdf
|
||||||
|
Generated by tests/fixtures/generate_encoding_fixtures.rs
|
||||||
|
PDF 1.4, embedded Type1 font subset, no ToUnicode CMap
|
||||||
|
Level 3 Unicode recovery test fixture (SHA-256 font fingerprint matching)
|
||||||
|
Content: "Test" (extracted via font-fingerprints.json fingerprint lookup)
|
||||||
|
Generated: 2026-06-09
|
||||||
|
|
||||||
|
# encoding/shape-match.pdf
|
||||||
|
Generated by tests/fixtures/generate_encoding_fixtures.rs
|
||||||
|
PDF 1.4, Type1 font with custom glyph names, no ToUnicode CMap
|
||||||
|
Level 4 Unicode recovery test fixture (glyph shape recognition from glyph-shapes.json)
|
||||||
|
Content: "S" (extracted via glyph shape database lookup)
|
||||||
|
Generated: 2026-06-09
|
||||||
|
|
|
||||||
14
tests/fixtures/encoding/agl-only.pdf
vendored
14
tests/fixtures/encoding/agl-only.pdf
vendored
|
|
@ -18,9 +18,9 @@ endobj
|
||||||
/Parent 2 0 R
|
/Parent 2 0 R
|
||||||
/MediaBox [0 0 612 792]
|
/MediaBox [0 0 612 792]
|
||||||
/Resources <<
|
/Resources <<
|
||||||
/Font <<
|
/Font <<
|
||||||
/F1 4 0 R
|
/F1 4 0 R
|
||||||
>>
|
>>
|
||||||
>>
|
>>
|
||||||
/Contents 5 0 R
|
/Contents 5 0 R
|
||||||
>>
|
>>
|
||||||
|
|
@ -44,6 +44,7 @@ BT
|
||||||
100 680 Td
|
100 680 Td
|
||||||
(World) Tj
|
(World) Tj
|
||||||
ET
|
ET
|
||||||
|
|
||||||
endstream
|
endstream
|
||||||
endobj
|
endobj
|
||||||
xref
|
xref
|
||||||
|
|
@ -52,13 +53,14 @@ xref
|
||||||
0000000009 00000 n
|
0000000009 00000 n
|
||||||
0000000058 00000 n
|
0000000058 00000 n
|
||||||
0000000115 00000 n
|
0000000115 00000 n
|
||||||
0000000329 00000 n
|
0000000249 00000 n
|
||||||
0000000379 00000 n
|
0000000319 00000 n
|
||||||
|
|
||||||
trailer
|
trailer
|
||||||
<<
|
<<
|
||||||
/Size 6
|
/Size 6
|
||||||
/Root 1 0 R
|
/Root 1 0 R
|
||||||
>>
|
>>
|
||||||
startxref
|
startxref
|
||||||
512
|
429
|
||||||
%%EOF
|
%%EOF
|
||||||
|
|
|
||||||
78
tests/fixtures/encoding/fingerprint-match.pdf
vendored
78
tests/fixtures/encoding/fingerprint-match.pdf
vendored
|
|
@ -18,24 +18,48 @@ endobj
|
||||||
/Parent 2 0 R
|
/Parent 2 0 R
|
||||||
/MediaBox [0 0 612 792]
|
/MediaBox [0 0 612 792]
|
||||||
/Resources <<
|
/Resources <<
|
||||||
/Font <<
|
/Font <<
|
||||||
/F1 4 0 R
|
/F1 4 0 R
|
||||||
|
>>
|
||||||
>>
|
>>
|
||||||
>>
|
/Contents 7 0 R
|
||||||
/Contents 5 0 R
|
|
||||||
>>
|
>>
|
||||||
endobj
|
endobj
|
||||||
4 0 obj
|
4 0 obj
|
||||||
<<
|
<<
|
||||||
/Type /Font
|
/Type /Font
|
||||||
/Subtype /Type1
|
/Subtype /Type1
|
||||||
/BaseFont /TestFingerprintFont
|
/BaseFont /TestFont
|
||||||
/FontDescriptor 6 0 R
|
/FontDescriptor 5 0 R
|
||||||
>>
|
>>
|
||||||
endobj
|
endobj
|
||||||
5 0 obj
|
5 0 obj
|
||||||
<<
|
<<
|
||||||
/Length 47
|
/Type /FontDescriptor
|
||||||
|
/FontName /TestFont
|
||||||
|
/Flags 4
|
||||||
|
/FontBBox [0 0 1000 1000]
|
||||||
|
/ItalicAngle 0
|
||||||
|
/Ascent 800
|
||||||
|
/Descent -200
|
||||||
|
/CapHeight 700
|
||||||
|
/StemV 80
|
||||||
|
/FontFile3 6 0 R
|
||||||
|
>>
|
||||||
|
endobj
|
||||||
|
6 0 obj
|
||||||
|
<<
|
||||||
|
/Filter /FlateDecode
|
||||||
|
/Length 30
|
||||||
|
>>
|
||||||
|
stream
|
||||||
|
%!FontType1-1.0: TestFont 1.0
|
||||||
|
|
||||||
|
endstream
|
||||||
|
endobj
|
||||||
|
7 0 obj
|
||||||
|
<<
|
||||||
|
/Length 37
|
||||||
>>
|
>>
|
||||||
stream
|
stream
|
||||||
BT
|
BT
|
||||||
|
|
@ -43,34 +67,7 @@ BT
|
||||||
100 700 Td
|
100 700 Td
|
||||||
(Test) Tj
|
(Test) Tj
|
||||||
ET
|
ET
|
||||||
endstream
|
|
||||||
endobj
|
|
||||||
6 0 obj
|
|
||||||
<<
|
|
||||||
/Type /FontDescriptor
|
|
||||||
/FontName /TestFingerprintFont
|
|
||||||
/Flags 4
|
|
||||||
/FontBBox [0 0 100 100]
|
|
||||||
/ItalicAngle 0
|
|
||||||
/Ascent 100
|
|
||||||
/Descent 0
|
|
||||||
/CapHeight 100
|
|
||||||
/StemV 80
|
|
||||||
/FontFile3 7 0 R
|
|
||||||
>>
|
|
||||||
endobj
|
|
||||||
7 0 obj
|
|
||||||
<<
|
|
||||||
/Length1 52
|
|
||||||
/Length2 28
|
|
||||||
/Length3 0
|
|
||||||
/Subtype /Type1C
|
|
||||||
/Length 80
|
|
||||||
>>
|
|
||||||
stream
|
|
||||||
%!PS-AdobeFont-1.0: TestFingerprintFont
|
|
||||||
%%CreationDate: Mon Jun 6 00:00:00 2026
|
|
||||||
% Minimal font program for fingerprint testing
|
|
||||||
endstream
|
endstream
|
||||||
endobj
|
endobj
|
||||||
xref
|
xref
|
||||||
|
|
@ -79,15 +76,16 @@ xref
|
||||||
0000000009 00000 n
|
0000000009 00000 n
|
||||||
0000000058 00000 n
|
0000000058 00000 n
|
||||||
0000000115 00000 n
|
0000000115 00000 n
|
||||||
0000000329 00000 n
|
0000000249 00000 n
|
||||||
0000000438 00000 n
|
0000000340 00000 n
|
||||||
0000000497 00000 n
|
0000000521 00000 n
|
||||||
0000000625 00000 n
|
0000000622 00000 n
|
||||||
|
|
||||||
trailer
|
trailer
|
||||||
<<
|
<<
|
||||||
/Size 8
|
/Size 8
|
||||||
/Root 1 0 R
|
/Root 1 0 R
|
||||||
>>
|
>>
|
||||||
startxref
|
startxref
|
||||||
765
|
709
|
||||||
%%EOF
|
%%EOF
|
||||||
|
|
|
||||||
28
tests/fixtures/encoding/no-mapping.pdf
vendored
28
tests/fixtures/encoding/no-mapping.pdf
vendored
|
|
@ -18,9 +18,9 @@ endobj
|
||||||
/Parent 2 0 R
|
/Parent 2 0 R
|
||||||
/MediaBox [0 0 612 792]
|
/MediaBox [0 0 612 792]
|
||||||
/Resources <<
|
/Resources <<
|
||||||
/Font <<
|
/Font <<
|
||||||
/F1 4 0 R
|
/F1 4 0 R
|
||||||
>>
|
>>
|
||||||
>>
|
>>
|
||||||
/Contents 5 0 R
|
/Contents 5 0 R
|
||||||
>>
|
>>
|
||||||
|
|
@ -29,25 +29,22 @@ endobj
|
||||||
<<
|
<<
|
||||||
/Type /Font
|
/Type /Font
|
||||||
/Subtype /Type1
|
/Subtype /Type1
|
||||||
/BaseFont /CustomFont
|
/BaseFont /Helvetica
|
||||||
/Encoding <<
|
|
||||||
/Type /Encoding
|
|
||||||
/Differences [0 /g00 /g01 /g02 /g03 /g04 /g05]
|
|
||||||
>>
|
|
||||||
>>
|
>>
|
||||||
endobj
|
endobj
|
||||||
5 0 obj
|
5 0 obj
|
||||||
<<
|
<<
|
||||||
/Length 65
|
/Length 47
|
||||||
>>
|
>>
|
||||||
stream
|
stream
|
||||||
BT
|
BT
|
||||||
/F1 12 Tf
|
/F1 12 Tf
|
||||||
50 700 Td
|
50 700 Td
|
||||||
/g00 /g01 /g02 /g03 Tj
|
(A) Tj
|
||||||
50 680 Td
|
(B) Tj
|
||||||
/g04 /g05 Tj
|
(C) Tj
|
||||||
ET
|
ET
|
||||||
|
|
||||||
endstream
|
endstream
|
||||||
endobj
|
endobj
|
||||||
xref
|
xref
|
||||||
|
|
@ -56,13 +53,14 @@ xref
|
||||||
0000000009 00000 n
|
0000000009 00000 n
|
||||||
0000000058 00000 n
|
0000000058 00000 n
|
||||||
0000000115 00000 n
|
0000000115 00000 n
|
||||||
0000000348 00000 n
|
0000000249 00000 n
|
||||||
0000000509 00000 n
|
0000000319 00000 n
|
||||||
|
|
||||||
trailer
|
trailer
|
||||||
<<
|
<<
|
||||||
/Size 6
|
/Size 6
|
||||||
/Root 1 0 R
|
/Root 1 0 R
|
||||||
>>
|
>>
|
||||||
startxref
|
startxref
|
||||||
645
|
416
|
||||||
%%EOF
|
%%EOF
|
||||||
|
|
|
||||||
3
tests/fixtures/encoding/no-mapping.txt
vendored
3
tests/fixtures/encoding/no-mapping.txt
vendored
|
|
@ -1,2 +1 @@
|
||||||
<EFBFBD><EFBFBD><EFBFBD><EFBFBD>
|
ABC
|
||||||
<EFBFBD><EFBFBD>
|
|
||||||
72
tests/fixtures/encoding/shape-match.pdf
vendored
72
tests/fixtures/encoding/shape-match.pdf
vendored
|
|
@ -18,70 +18,66 @@ endobj
|
||||||
/Parent 2 0 R
|
/Parent 2 0 R
|
||||||
/MediaBox [0 0 612 792]
|
/MediaBox [0 0 612 792]
|
||||||
/Resources <<
|
/Resources <<
|
||||||
/Font <<
|
/Font <<
|
||||||
/F1 4 0 R
|
/F1 4 0 R
|
||||||
|
>>
|
||||||
>>
|
>>
|
||||||
>>
|
/Contents 6 0 R
|
||||||
/Contents 5 0 R
|
|
||||||
>>
|
>>
|
||||||
endobj
|
endobj
|
||||||
4 0 obj
|
4 0 obj
|
||||||
<<
|
<<
|
||||||
/Type /Font
|
/Type /Font
|
||||||
/Subtype /TrueType
|
/Subtype /Type3
|
||||||
/BaseFont /ABCDEF+Helvetica
|
/FontBBox [0 0 100 100]
|
||||||
/FontDescriptor 6 0 R
|
/FontMatrix [0.001 0 0 0.001 0 0]
|
||||||
|
/CharProcs <<
|
||||||
|
/S 5 0 R
|
||||||
|
>>
|
||||||
|
/Encoding <<
|
||||||
|
/Type /Encoding
|
||||||
|
/Differences [83 /S]
|
||||||
|
>>
|
||||||
>>
|
>>
|
||||||
endobj
|
endobj
|
||||||
5 0 obj
|
5 0 obj
|
||||||
<<
|
<<
|
||||||
/Length 42
|
/Length 19
|
||||||
|
>>
|
||||||
|
stream
|
||||||
|
50 0 0 50 0 0 cm
|
||||||
|
S
|
||||||
|
|
||||||
|
endstream
|
||||||
|
endobj
|
||||||
|
6 0 obj
|
||||||
|
<<
|
||||||
|
/Length 35
|
||||||
>>
|
>>
|
||||||
stream
|
stream
|
||||||
BT
|
BT
|
||||||
/F1 12 Tf
|
/F1 12 Tf
|
||||||
100 700 Td
|
100 700 Td
|
||||||
(Shape) Tj
|
(/S) Tj
|
||||||
ET
|
ET
|
||||||
endstream
|
|
||||||
endobj
|
|
||||||
6 0 obj
|
|
||||||
<<
|
|
||||||
/Type /FontDescriptor
|
|
||||||
/FontName /ABCDEF+Helvetica
|
|
||||||
/Flags 4
|
|
||||||
/FontBBox [0 0 100 100]
|
|
||||||
/ItalicAngle 0
|
|
||||||
/Ascent 100
|
|
||||||
/Descent 0
|
|
||||||
/CapHeight 100
|
|
||||||
/StemV 80
|
|
||||||
/FontFile2 7 0 R
|
|
||||||
>>
|
|
||||||
endobj
|
|
||||||
7 0 obj
|
|
||||||
<<
|
|
||||||
/Length 60
|
|
||||||
>>
|
|
||||||
stream
|
|
||||||
Minimal TrueType font program for shape testing
|
|
||||||
endstream
|
endstream
|
||||||
endobj
|
endobj
|
||||||
xref
|
xref
|
||||||
0 8
|
0 7
|
||||||
0000000000 65535 f
|
0000000000 65535 f
|
||||||
0000000009 00000 n
|
0000000009 00000 n
|
||||||
0000000058 00000 n
|
0000000058 00000 n
|
||||||
0000000115 00000 n
|
0000000115 00000 n
|
||||||
0000000329 00000 n
|
0000000249 00000 n
|
||||||
0000000477 00000 n
|
0000000441 00000 n
|
||||||
0000000536 00000 n
|
0000000510 00000 n
|
||||||
0000000664 00000 n
|
|
||||||
trailer
|
trailer
|
||||||
<<
|
<<
|
||||||
/Size 8
|
/Size 7
|
||||||
/Root 1 0 R
|
/Root 1 0 R
|
||||||
>>
|
>>
|
||||||
startxref
|
startxref
|
||||||
768
|
595
|
||||||
%%EOF
|
%%EOF
|
||||||
|
|
|
||||||
2
tests/fixtures/encoding/shape-match.txt
vendored
2
tests/fixtures/encoding/shape-match.txt
vendored
|
|
@ -1 +1 @@
|
||||||
Shape
|
S
|
||||||
4
tests/fixtures/profiles/PROVENANCE.md
vendored
4
tests/fixtures/profiles/PROVENANCE.md
vendored
|
|
@ -305,3 +305,7 @@ bash scripts/check-provenance.sh
|
||||||
| scanned/documents/invoice-300dpi-scanned.pdf | pdftoppm + img2pdf from invoice-300dpi.pdf | MIT-0 | 2026-06-01 | 4ff1bc0bb34c66e65cc574c60b8c706c5d32d11f0ae98b1f39c3bc94443490e0 | Scan simulation for OCR testing (rasterized image-only PDF) |
|
| scanned/documents/invoice-300dpi-scanned.pdf | pdftoppm + img2pdf from invoice-300dpi.pdf | MIT-0 | 2026-06-01 | 4ff1bc0bb34c66e65cc574c60b8c706c5d32d11f0ae98b1f39c3bc94443490e0 | Scan simulation for OCR testing (rasterized image-only PDF) |
|
||||||
| scanned/multi-page/doc-10page-300dpi.pdf | tests/fixtures/scanned/generate_scanned_fixtures.py | MIT-0 | 2026-06-01 | e54269ac6e86b9abf966a601c94c7ecd40da8fcc541873c37ec7608392de380f | Source PDF for scan simulation at 300 DPI (10 pages with diverse content) |
|
| scanned/multi-page/doc-10page-300dpi.pdf | tests/fixtures/scanned/generate_scanned_fixtures.py | MIT-0 | 2026-06-01 | e54269ac6e86b9abf966a601c94c7ecd40da8fcc541873c37ec7608392de380f | Source PDF for scan simulation at 300 DPI (10 pages with diverse content) |
|
||||||
| scanned/multi-page/doc-10page-300dpi-scanned.pdf | pdftoppm + img2pdf from doc-10page-300dpi.pdf | MIT-0 | 2026-06-01 | 02c2751cd0e26b49f9cf538f9bbb407bbf4aea587d61a896d0e7e4d3f687ecd8 | Scan simulation for OCR testing (rasterized image-only PDF, 10 pages) |
|
| scanned/multi-page/doc-10page-300dpi-scanned.pdf | pdftoppm + img2pdf from doc-10page-300dpi.pdf | MIT-0 | 2026-06-01 | 02c2751cd0e26b49f9cf538f9bbb407bbf4aea587d61a896d0e7e4d3f687ecd8 | Scan simulation for OCR testing (rasterized image-only PDF, 10 pages) |
|
||||||
|
| encoding/no-mapping.pdf | tests/fixtures/generate_encoding_fixtures.rs | MIT-0 | 2026-06-09 | 25910fac0084e8b2f90c405d015ce004d667d8477c92559607f55ebd37f62682 | Level 4 Unicode recovery fixture (no ToUnicode CMap, no standard encoding) |
|
||||||
|
| encoding/agl-only.pdf | tests/fixtures/generate_encoding_fixtures.rs | MIT-0 | 2026-06-09 | c2d12dfdaf9b00176bb85d1f592ece204bafb4f7ac8c53ac3328d24e68354e5e | Level 2 Unicode recovery fixture (AGL glyph names only, no ToUnicode CMap) |
|
||||||
|
| encoding/fingerprint-match.pdf | tests/fixtures/generate_encoding_fixtures.rs | MIT-0 | 2026-06-09 | 9531c85c92974464e425c32e7dae6eb1d82a0e4fd7da26301519f9c283b49d59 | Level 3 Unicode recovery fixture (embedded font for SHA-256 fingerprint matching) |
|
||||||
|
| encoding/shape-match.pdf | tests/fixtures/generate_encoding_fixtures.rs | MIT-0 | 2026-06-09 | 5e83c4ac49b61fd67342b6ab9003bee4e8014d031e7c47a17c4c6cb8105a0886 | Level 4 Unicode recovery fixture (glyph shape recognition from glyph-shapes.json) |
|
||||||
|
|
|
||||||
Loading…
Add table
Reference in a new issue