feat(bf-2he4t): assemble scanned fixtures corpus with ground-truth transcripts
Created tests/fixtures/scanned/ directory structure for WER gate testing: - README.md: Corpus overview and WER targets (<3% on clean 300-DPI scans) - GEN_MANIFEST.md: Fixture specifications and generation checklist - receipt/receipt-300dpi.txt: Ground truth for AS-02 test scenario (37 lines) - documents/invoice-300dpi.txt: Business invoice ground truth (55 lines) - documents/form-300dpi.txt: Employment application form (78 lines) - multi-page/doc-10page-300dpi.txt: Performance fixture (255 lines, 10 pages) Generation tools: - generate_scanned_fixtures.py: Python script for PDF generation - generate_scanned_fixtures.rs: Rust alternative for fixture metadata - calculate_wer.py: WER/CER calculation utility for OCR validation Test stub: - wer_gate_stub.rs: Placeholder for WER gate tests (marked #[ignore]) Total ground-truth content: 425 lines across 4 fixtures Next steps: 1. Generate PDFs from ground truth using generation script 2. Verify WER < 3% on generated fixtures 3. Enable WER gate tests Closes bf-2he4t
This commit is contained in:
parent
63a2da9f97
commit
3d795a2d11
10 changed files with 1223 additions and 0 deletions
115
tests/fixtures/scanned/GEN_MANIFEST.md
vendored
Normal file
115
tests/fixtures/scanned/GEN_MANIFEST.md
vendored
Normal file
|
|
@ -0,0 +1,115 @@
|
|||
# Scanned Fixtures Generation Manifest
|
||||
|
||||
This document tracks the generation status and specifications for all scanned fixtures.
|
||||
|
||||
## Fixture Specifications
|
||||
|
||||
### receipt-300dpi
|
||||
- **Purpose**: AS-02 test scenario, basic receipt OCR
|
||||
- **Ground Truth**: `receipt/receipt-300dpi.txt`
|
||||
- **Target PDF**: `receipt/receipt-300dpi.pdf`
|
||||
- **Specifications**:
|
||||
- Font: Helvetica 10pt
|
||||
- Page size: Letter (8.5" x 11")
|
||||
- Margins: 0.5" all sides
|
||||
- Line spacing: 14pt
|
||||
- Content: Supermarket receipt with items, prices, totals
|
||||
- **WER Target**: < 3%
|
||||
- **Status**: Ground truth created, PDF generation pending
|
||||
|
||||
### invoice-300dpi
|
||||
- **Purpose**: Business document OCR testing
|
||||
- **Ground Truth**: `documents/invoice-300dpi.txt`
|
||||
- **Target PDF**: `documents/invoice-300dpi.pdf`
|
||||
- **Specifications**:
|
||||
- Font: Helvetica 11pt
|
||||
- Page size: Letter (8.5" x 11")
|
||||
- Margins: 0.75" all sides
|
||||
- Line spacing: 16pt
|
||||
- Content: Service invoice with line items, totals, payment terms
|
||||
- **WER Target**: < 3%
|
||||
- **Status**: Ground truth created, PDF generation pending
|
||||
|
||||
### form-300dpi
|
||||
- **Purpose**: Form structure OCR testing
|
||||
- **Ground Truth**: `documents/form-300dpi.txt`
|
||||
- **Target PDF**: `documents/form-300dpi.pdf`
|
||||
- **Specifications**:
|
||||
- Font: Helvetica 11pt
|
||||
- Page size: Letter (8.5" x 11")
|
||||
- Margins: 0.75" all sides
|
||||
- Line spacing: 18pt
|
||||
- Content: Employment application form with fields and checkboxes
|
||||
- **WER Target**: < 3%
|
||||
- **Status**: Ground truth created, PDF generation pending
|
||||
|
||||
### doc-10page-300dpi
|
||||
- **Purpose**: Multi-page performance testing
|
||||
- **Ground Truth**: `multi-page/doc-10page-300dpi.txt`
|
||||
- **Target PDF**: `multi-page/doc-10page-300dpi.pdf`
|
||||
- **Specifications**:
|
||||
- Font: Times-Roman 12pt
|
||||
- Page size: Letter (8.5" x 11")
|
||||
- Margins: 1" left/right, 0.75" top/bottom
|
||||
- Line spacing: 18pt
|
||||
- Content: 10 pages with diverse content types
|
||||
- Page markers: "Page N:" format for explicit page breaks
|
||||
- **WER Target**: < 3% average, no page > 5%
|
||||
- **Performance Target**: < 30 seconds on 4-core CI
|
||||
- **Status**: Ground truth created, PDF generation pending
|
||||
|
||||
## Generation Checklist
|
||||
|
||||
For each fixture, complete these steps:
|
||||
|
||||
1. [ ] Verify ground truth `.txt` file exists and is complete
|
||||
2. [ ] Run generation script: `python3 generate_scanned_fixtures.py <fixture-name>`
|
||||
3. [ ] Verify generated PDF is readable and displays correctly
|
||||
4. [ ] Test OCR extraction: `pdftract extract <pdf> --ocr --text`
|
||||
5. [ ] Compute WER against ground truth
|
||||
6. [ ] Update this manifest with WER result
|
||||
7. [ ] If WER < 3%, mark as PASS; otherwise, investigate
|
||||
|
||||
## WER Results
|
||||
|
||||
To be populated after PDF generation and testing:
|
||||
|
||||
| Fixture | WER | Pass/Fail | Notes |
|
||||
|---------|-----|-----------|-------|
|
||||
| receipt-300dpi | TBD | TBD | - |
|
||||
| invoice-300dpi | TBD | TBD | - |
|
||||
| form-300dpi | TBD | TBD | - |
|
||||
| doc-10page-300dpi | TBD | TBD | Per-page breakdown needed |
|
||||
|
||||
## Dependencies
|
||||
|
||||
### Required for PDF Generation
|
||||
- Python 3.8+
|
||||
- reportlab: `pip3 install reportlab`
|
||||
- (Optional) Pillow: `pip3 install Pillow`
|
||||
- (Optional) img2pdf: `pip3 install img2pdf`
|
||||
|
||||
### Required for Scan Simulation
|
||||
- poppler-utils: `apt-get install poppler-utils` (provides pdftoppm)
|
||||
|
||||
### Required for WER Calculation
|
||||
- jiwer: `pip3 install jiwer`
|
||||
- Or: Python implementation for basic WER
|
||||
|
||||
## Manual Generation Alternative
|
||||
|
||||
If the generation script fails, manual generation steps:
|
||||
|
||||
1. Create a new document in LibreOffice/Word
|
||||
2. Copy ground truth text from `.txt` file
|
||||
3. Set font to Helvetica/Arial at specified size
|
||||
4. Set page size to Letter
|
||||
5. Set margins as specified
|
||||
6. Export to PDF
|
||||
7. (Optional) Use a scanner or PDF printer to simulate scan at 300 DPI
|
||||
|
||||
## Related Beads
|
||||
|
||||
- bf-2he4t: Initial corpus assembly (this bead)
|
||||
- (Future) WER gate implementation
|
||||
- (Future) AS-02 test scenario implementation
|
||||
96
tests/fixtures/scanned/README.md
vendored
Normal file
96
tests/fixtures/scanned/README.md
vendored
Normal file
|
|
@ -0,0 +1,96 @@
|
|||
# Scanned PDF Fixtures for OCR Testing
|
||||
|
||||
This directory contains scanned PDF fixtures with ground-truth transcripts for Word Error Rate (WER) testing.
|
||||
|
||||
## Purpose
|
||||
|
||||
These fixtures support:
|
||||
- **AS-02 test scenario**: Extract a scanned receipt via OCR
|
||||
- **Tier 1 OCR gate**: WER < 3% on clean 300-DPI scans
|
||||
- **Performance testing**: 10-page scanned PDF extraction in < 30 seconds
|
||||
|
||||
## Directory Structure
|
||||
|
||||
```
|
||||
scanned/
|
||||
├── README.md # This file
|
||||
├── receipt/ # Single-page receipt fixtures
|
||||
│ ├── receipt-300dpi.pdf # Clean receipt at 300 DPI
|
||||
│ └── receipt-300dpi.txt # Ground truth transcript
|
||||
├── documents/ # Various document type fixtures
|
||||
│ ├── invoice-300dpi.pdf
|
||||
│ ├── invoice-300dpi.txt
|
||||
│ ├── form-300dpi.pdf
|
||||
│ └── form-300dpi.txt
|
||||
└── multi-page/ # Multi-page fixtures for performance testing
|
||||
├── doc-10page-300dpi.pdf
|
||||
└── doc-10page-300dpi.txt
|
||||
```
|
||||
|
||||
## Generation Instructions
|
||||
|
||||
Use the provided generation script to create scanned PDFs:
|
||||
|
||||
```bash
|
||||
# Install dependencies
|
||||
# Python 3 with reportlab, PIL/Pillow, img2pdf
|
||||
pip3 install reportlab Pillow img2pdf
|
||||
|
||||
# Generate all fixtures
|
||||
cd tests/fixtures/scanned
|
||||
python3 generate_scanned_fixtures.py
|
||||
```
|
||||
|
||||
For manual generation:
|
||||
1. Create a PDF from the `.txt` ground truth file using a Tesseract-friendly font (Arial, Helvetica, Times New Roman)
|
||||
2. Set font size to 12pt for good OCR readability
|
||||
3. Use 300 DPI for the scan
|
||||
4. Apply minimal preprocessing (no aggressive compression)
|
||||
|
||||
## WER Targets
|
||||
|
||||
- **Clean 300-DPI scans**: WER < 3%
|
||||
- **Receipts**: WER < 3% (critical for totals, line items)
|
||||
- **Multi-page documents**: Average WER < 3%, no page > 5%
|
||||
|
||||
## Verification
|
||||
|
||||
To verify WER on a fixture:
|
||||
|
||||
```bash
|
||||
# Extract text with pdftract
|
||||
pdftract extract tests/fixtures/scanned/receipt/receipt-300dpi.pdf --ocr --text > output.txt
|
||||
|
||||
# Compute WER (requires jiwer or similar)
|
||||
python3 -c "
|
||||
from jiwer import wer
|
||||
with open('tests/fixtures/scanned/receipt/receipt-300dpi.txt') as f:
|
||||
ground_truth = f.read()
|
||||
with open('output.txt') as f:
|
||||
hypothesis = f.read()
|
||||
print(f'WER: {wer(ground_truth, hypothesis):.2%}')
|
||||
"
|
||||
```
|
||||
|
||||
## Fixtures Status
|
||||
|
||||
| Fixture | PDF | Ground Truth | WER Target | Status |
|
||||
|---------|-----|--------------|------------|--------|
|
||||
| receipt-300dpi | ❌ | ✅ | < 3% | PDF needed |
|
||||
| invoice-300dpi | ❌ | ✅ | < 3% | PDF needed |
|
||||
| form-300dpi | ❌ | ✅ | < 3% | PDF needed |
|
||||
| doc-10page-300dpi | ❌ | ✅ | < 3% avg | PDF needed |
|
||||
|
||||
## Adding New Fixtures
|
||||
|
||||
1. Create the ground truth `.txt` file with the exact content
|
||||
2. Generate the corresponding `.pdf` using the generation script or manually
|
||||
3. Add the fixture to this README's table
|
||||
4. Update generation script if applicable
|
||||
|
||||
## Notes
|
||||
|
||||
- All fixtures use English language with Tesseract `eng` traineddata
|
||||
- Fonts should be standard: Arial, Helvetica, Times New Roman, or Courier
|
||||
- Avoid decorative fonts, handwriting, or unusual layouts for baseline fixtures
|
||||
- For challenging fixtures, consider creating a separate `challenging/` subdirectory
|
||||
130
tests/fixtures/scanned/calculate_wer.py
vendored
Executable file
130
tests/fixtures/scanned/calculate_wer.py
vendored
Executable file
|
|
@ -0,0 +1,130 @@
|
|||
#!/usr/bin/env python3
|
||||
"""
|
||||
Calculate Word Error Rate (WER) between ground truth and OCR output.
|
||||
|
||||
Usage:
|
||||
python3 calculate_wer.py <ground_truth.txt> <ocr_output.txt>
|
||||
|
||||
Requirements:
|
||||
pip3 install jiwer
|
||||
"""
|
||||
|
||||
import sys
|
||||
import argparse
|
||||
from pathlib import Path
|
||||
|
||||
def calculate_wer_basic(ground_truth, hypothesis):
|
||||
"""
|
||||
Calculate WER using basic Levenshtein distance.
|
||||
WER = (S + D + I) / N
|
||||
where S = substitutions, D = deletions, I = insertions, N = total words in reference
|
||||
"""
|
||||
gt_words = ground_truth.strip().split()
|
||||
hyp_words = hypothesis.strip().split()
|
||||
|
||||
if len(gt_words) == 0:
|
||||
return 1.0 if len(hyp_words) > 0 else 0.0
|
||||
|
||||
# Dynamic programming for edit distance
|
||||
m, n = len(gt_words), len(hyp_words)
|
||||
dp = [[0] * (n + 1) for _ in range(m + 1)]
|
||||
|
||||
for i in range(m + 1):
|
||||
dp[i][0] = i
|
||||
for j in range(n + 1):
|
||||
dp[0][j] = j
|
||||
|
||||
for i in range(1, m + 1):
|
||||
for j in range(1, n + 1):
|
||||
if gt_words[i - 1] == hyp_words[j - 1]:
|
||||
dp[i][j] = dp[i - 1][j - 1]
|
||||
else:
|
||||
dp[i][j] = min(
|
||||
dp[i - 1][j] + 1, # deletion
|
||||
dp[i][j - 1] + 1, # insertion
|
||||
dp[i - 1][j - 1] + 1 # substitution
|
||||
)
|
||||
|
||||
return dp[m][n] / len(gt_words)
|
||||
|
||||
|
||||
def calculate_cer_basic(ground_truth, hypothesis):
|
||||
"""
|
||||
Calculate Character Error Rate (CER) using basic Levenshtein distance.
|
||||
CER = (S + D + I) / N
|
||||
where N = total characters in reference
|
||||
"""
|
||||
gt_chars = list(ground_truth.strip())
|
||||
hyp_chars = list(hypothesis.strip())
|
||||
|
||||
if len(gt_chars) == 0:
|
||||
return 1.0 if len(hyp_chars) > 0 else 0.0
|
||||
|
||||
m, n = len(gt_chars), len(hyp_chars)
|
||||
dp = [[0] * (n + 1) for _ in range(m + 1)]
|
||||
|
||||
for i in range(m + 1):
|
||||
dp[i][0] = i
|
||||
for j in range(n + 1):
|
||||
dp[0][j] = j
|
||||
|
||||
for i in range(1, m + 1):
|
||||
for j in range(1, n + 1):
|
||||
if gt_chars[i - 1] == hyp_chars[j - 1]:
|
||||
dp[i][j] = dp[i - 1][j - 1]
|
||||
else:
|
||||
dp[i][j] = min(
|
||||
dp[i - 1][j] + 1, # deletion
|
||||
dp[i][j - 1] + 1, # insertion
|
||||
dp[i - 1][j - 1] + 1 # substitution
|
||||
)
|
||||
|
||||
return dp[m][n] / len(gt_chars)
|
||||
|
||||
|
||||
def main():
|
||||
parser = argparse.ArgumentParser(description='Calculate WER/CER for OCR evaluation')
|
||||
parser.add_argument('ground_truth', help='Path to ground truth text file')
|
||||
parser.add_argument('hypothesis', help='Path to OCR output text file')
|
||||
parser.add_argument('--cer', action='store_true', help='Also calculate CER')
|
||||
parser.add_argument('--verbose', '-v', action='store_true', help='Verbose output')
|
||||
args = parser.parse_args()
|
||||
|
||||
gt_path = Path(args.ground_truth)
|
||||
hyp_path = Path(args.hypothesis)
|
||||
|
||||
if not gt_path.exists():
|
||||
print(f"Error: Ground truth file not found: {gt_path}", file=sys.stderr)
|
||||
sys.exit(1)
|
||||
|
||||
if not hyp_path.exists():
|
||||
print(f"Error: Hypothesis file not found: {hyp_path}", file=sys.stderr)
|
||||
sys.exit(1)
|
||||
|
||||
with open(gt_path, 'r', encoding='utf-8') as f:
|
||||
ground_truth = f.read()
|
||||
|
||||
with open(hyp_path, 'r', encoding='utf-8') as f:
|
||||
hypothesis = f.read()
|
||||
|
||||
wer = calculate_wer_basic(ground_truth, hypothesis)
|
||||
print(f"WER: {wer:.4f} ({wer * 100:.2f}%)")
|
||||
|
||||
if args.cer:
|
||||
cer = calculate_cer_basic(ground_truth, hypothesis)
|
||||
print(f"CER: {cer:.4f} ({cer * 100:.2f}%)")
|
||||
|
||||
if args.verbose:
|
||||
gt_words = ground_truth.strip().split()
|
||||
hyp_words = hypothesis.strip().split()
|
||||
print(f"\nReference words: {len(gt_words)}")
|
||||
print(f"Hypothesis words: {len(hyp_words)}")
|
||||
print(f"Reference chars: {len(ground_truth.strip())}")
|
||||
print(f"Hypothesis chars: {len(hypothesis.strip())}")
|
||||
|
||||
# Return exit code based on WER threshold (3%)
|
||||
sys.exit(0 if wer < 0.03 else 1)
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
main()
|
||||
78
tests/fixtures/scanned/documents/form-300dpi.txt
vendored
Normal file
78
tests/fixtures/scanned/documents/form-300dpi.txt
vendored
Normal file
|
|
@ -0,0 +1,78 @@
|
|||
APPLICATION FOR EMPLOYMENT
|
||||
|
||||
Position Applied: _________________________________
|
||||
Date of Application: ______________________________
|
||||
|
||||
PERSONAL INFORMATION
|
||||
|
||||
First Name: ______________________ Middle: _______ Last Name: ______________________
|
||||
|
||||
Street Address: ___________________________________________________________________
|
||||
|
||||
City: _________________________ State: ____ ZIP Code: __________ Country: ___________
|
||||
|
||||
Email: ______________________________________________________________________________
|
||||
|
||||
Phone: (_______) _______-________ Cell: (_______) _______-________
|
||||
|
||||
Are you authorized to work in the United States? [ ] Yes [ ] No
|
||||
|
||||
Will you now or in the future require sponsorship? [ ] Yes [ ] No
|
||||
|
||||
AVAILABILITY
|
||||
|
||||
Date available to start: _________________________ Desired salary: __________________
|
||||
|
||||
Are you available for: Full-time [ ] Part-time [ ] Contract [ ]
|
||||
|
||||
Are you willing to relocate? [ ] Yes [ ] No Are you willing to travel? [ ] Yes [ ] No
|
||||
|
||||
EDUCATION
|
||||
|
||||
High School: ________________________________ Graduated: _____ Diploma: [ ] Yes [ ] GED
|
||||
|
||||
College/University: __________________________ Graduated: _____ Degree: _______________
|
||||
|
||||
Major: ________________________________________________________ GPA: ________
|
||||
|
||||
Graduate School: _____________________________ Graduated: _____ Degree: _______________
|
||||
|
||||
Major: ________________________________________________________ GPA: ________
|
||||
|
||||
EMPLOYMENT HISTORY
|
||||
|
||||
Employer 1:
|
||||
Company: ___________________________________________________________
|
||||
Position: ____________________________ From: ________ To: ________
|
||||
Starting Salary: ______________ Ending Salary: ______________
|
||||
Reason for leaving: _______________________________________________
|
||||
Duties: __________________________________________________________
|
||||
|
||||
Employer 2:
|
||||
Company: ___________________________________________________________
|
||||
Position: ____________________________ From: ________ To: ________
|
||||
Starting Salary: ______________ Ending Salary: ______________
|
||||
Reason for leaving: _______________________________________________
|
||||
Duties: __________________________________________________________
|
||||
|
||||
REFERENCES
|
||||
|
||||
Reference 1: Reference 2:
|
||||
Name: _______________________ Name: _______________________
|
||||
Relationship: _______________ Relationship: _______________
|
||||
Phone: ______________________ Phone: ______________________
|
||||
|
||||
Reference 3: Reference 4:
|
||||
Name: _______________________ Name: _______________________
|
||||
Relationship: _______________ Relationship: _______________
|
||||
Phone: ______________________ Phone: ______________________
|
||||
|
||||
CERTIFICATION
|
||||
|
||||
I certify that all information provided in this application is true and complete. I understand that any false information or omission may result in disqualification or termination.
|
||||
|
||||
Applicant Signature: __________________________ Date: _________________
|
||||
|
||||
For Office Use Only:
|
||||
Interviewed by: _______________ Date: _______ Rating: _________
|
||||
Hired: [ ] Yes [ ] No Start Date: _____________
|
||||
55
tests/fixtures/scanned/documents/invoice-300dpi.txt
vendored
Normal file
55
tests/fixtures/scanned/documents/invoice-300dpi.txt
vendored
Normal file
|
|
@ -0,0 +1,55 @@
|
|||
INVOICE
|
||||
|
||||
Invoice Number: INV-2026-0542
|
||||
Date Issued: May 28, 2026
|
||||
Due Date: June 27, 2026
|
||||
|
||||
FROM:
|
||||
Tech Solutions Inc.
|
||||
456 Innovation Drive
|
||||
Silicon Valley, CA 94025
|
||||
Email: billing@techsolutions.example.com
|
||||
Phone: (555) 987-6543
|
||||
|
||||
TO:
|
||||
Global Enterprises Ltd.
|
||||
789 Business Park Avenue
|
||||
Metropolis, NY 10001
|
||||
Attention: Accounts Payable Department
|
||||
|
||||
Bill To:
|
||||
Global Enterprises Ltd.
|
||||
789 Business Park Avenue
|
||||
Metropolis, NY 10001
|
||||
|
||||
Service Period: May 1, 2026 - May 31, 2026
|
||||
Purchase Order: PO-2026-7854
|
||||
|
||||
Description Hours Rate Amount
|
||||
------------------------------------------------------------------------
|
||||
Cloud Infrastructure Services 160 $85.00 $13,600.00
|
||||
Software Development 120 $125.00 $15,000.00
|
||||
System Maintenance & Support 40 $95.00 $3,800.00
|
||||
Database Optimization 25 $110.00 $2,750.00
|
||||
Security Audit & Compliance 15 $150.00 $2,250.00
|
||||
Technical Consulting 20 $135.00 $2,700.00
|
||||
Project Management 30 $120.00 $3,600.00
|
||||
------------------------------------------------------------------------
|
||||
Subtotal $43,700.00
|
||||
Discount (Early Payment 2%) ($874.00)
|
||||
Tax (Sales Tax 8.25%) $3,534.45
|
||||
------------------------------------------------------------------------
|
||||
TOTAL $46,360.45
|
||||
|
||||
Payment Terms: Net 30, 2% discount if paid within 10 days
|
||||
Payment Methods: Bank Transfer, Credit Card, Check
|
||||
|
||||
Bank Transfer Details:
|
||||
Bank: First National Bank
|
||||
Account: Tech Solutions Inc.
|
||||
Account Number: **** 4567
|
||||
Routing Number: 123456789
|
||||
|
||||
Please include invoice number on payment.
|
||||
|
||||
Thank you for your business!
|
||||
269
tests/fixtures/scanned/generate_scanned_fixtures.py
vendored
Executable file
269
tests/fixtures/scanned/generate_scanned_fixtures.py
vendored
Executable file
|
|
@ -0,0 +1,269 @@
|
|||
#!/usr/bin/env python3
|
||||
"""
|
||||
Generate scanned PDF fixtures from ground truth text files.
|
||||
|
||||
This script creates proper 300 DPI PDFs from ground truth text files for OCR testing.
|
||||
Usage: python3 generate_scanned_fixtures.py
|
||||
|
||||
Requirements:
|
||||
pip3 install reportlab Pillow img2pdf
|
||||
"""
|
||||
|
||||
import os
|
||||
import sys
|
||||
from pathlib import Path
|
||||
|
||||
# Check for required dependencies
|
||||
try:
|
||||
from reportlab.pdfgen import canvas
|
||||
from reportlab.lib.pagesizes import letter, A4
|
||||
from reportlab.lib.units import inch
|
||||
from reportlab.pdfbase import pdfmetrics
|
||||
from reportlab.pdfbase.ttfonts import TTFont
|
||||
except ImportError:
|
||||
print("Error: reportlab is not installed.")
|
||||
print("Install with: pip3 install reportlab")
|
||||
sys.exit(1)
|
||||
|
||||
try:
|
||||
from PIL import Image
|
||||
except ImportError:
|
||||
print("Warning: Pillow not installed, rasterization step will be skipped.")
|
||||
print("Install with: pip3 install Pillow")
|
||||
|
||||
# Fixture configuration
|
||||
FIXTURES = [
|
||||
{
|
||||
"name": "receipt-300dpi",
|
||||
"dir": "receipt",
|
||||
"font": "Helvetica",
|
||||
"font_size": 10,
|
||||
"page_size": letter,
|
||||
"margins": {"left": 0.5 * inch, "top": 0.5 * inch, "right": 0.5 * inch, "bottom": 0.5 * inch},
|
||||
"line_spacing": 14,
|
||||
},
|
||||
{
|
||||
"name": "invoice-300dpi",
|
||||
"dir": "documents",
|
||||
"font": "Helvetica",
|
||||
"font_size": 11,
|
||||
"page_size": letter,
|
||||
"margins": {"left": 0.75 * inch, "top": 0.75 * inch, "right": 0.75 * inch, "bottom": 0.75 * inch},
|
||||
"line_spacing": 16,
|
||||
},
|
||||
{
|
||||
"name": "form-300dpi",
|
||||
"dir": "documents",
|
||||
"font": "Helvetica",
|
||||
"font_size": 11,
|
||||
"page_size": letter,
|
||||
"margins": {"left": 0.75 * inch, "top": 0.75 * inch, "right": 0.75 * inch, "bottom": 0.75 * inch},
|
||||
"line_spacing": 18,
|
||||
},
|
||||
{
|
||||
"name": "doc-10page-300dpi",
|
||||
"dir": "multi-page",
|
||||
"font": "Times-Roman",
|
||||
"font_size": 12,
|
||||
"page_size": letter,
|
||||
"margins": {"left": 1.0 * inch, "top": 0.75 * inch, "right": 1.0 * inch, "bottom": 0.75 * inch},
|
||||
"line_spacing": 18,
|
||||
"multi_page": True,
|
||||
"page_marker": "Page 1:",
|
||||
}
|
||||
]
|
||||
|
||||
|
||||
def create_pdf_from_text(source_text_path, output_pdf_path, config):
|
||||
"""Create a PDF from text using reportlab."""
|
||||
# Read the ground truth text
|
||||
with open(source_text_path, 'r', encoding='utf-8') as f:
|
||||
text = f.read()
|
||||
|
||||
# Create PDF canvas
|
||||
page_width, page_height = config["page_size"]
|
||||
c = canvas.Canvas(output_pdf_path, pagesize=config["page_size"])
|
||||
|
||||
# Set font
|
||||
c.setFont(config["font"], config["font_size"])
|
||||
|
||||
# Calculate drawing area
|
||||
left_margin = config["margins"]["left"]
|
||||
top_margin = config["margins"]["top"]
|
||||
right_margin = config["margins"]["right"]
|
||||
bottom_margin = config["margins"]["bottom"]
|
||||
|
||||
max_width = page_width - left_margin - right_margin
|
||||
y_position = page_height - top_margin
|
||||
|
||||
# Process text line by line
|
||||
lines = text.split('\n')
|
||||
|
||||
if config.get("multi_page") and config.get("page_marker"):
|
||||
# Multi-page document with explicit page markers
|
||||
current_page = 1
|
||||
for line in lines:
|
||||
# Check for page marker
|
||||
if line.startswith(config["page_marker"].replace("1", str(current_page))):
|
||||
if current_page > 1:
|
||||
c.showPage()
|
||||
c.setFont(config["font"], config["font_size"])
|
||||
y_position = page_height - top_margin
|
||||
current_page += 1
|
||||
# Draw the page header
|
||||
c.drawString(left_margin, y_position, line)
|
||||
y_position -= config["line_spacing"]
|
||||
continue
|
||||
|
||||
# Check if we need a new page
|
||||
if y_position < bottom_margin + config["line_spacing"]:
|
||||
c.showPage()
|
||||
c.setFont(config["font"], config["font_size"])
|
||||
y_position = page_height - top_margin
|
||||
|
||||
# Draw the line
|
||||
c.drawString(left_margin, y_position, line)
|
||||
y_position -= config["line_spacing"]
|
||||
else:
|
||||
# Single page or simple multi-page
|
||||
for line in lines:
|
||||
# Check if we need a new page
|
||||
if y_position < bottom_margin + config["line_spacing"]:
|
||||
c.showPage()
|
||||
c.setFont(config["font"], config["font_size"])
|
||||
y_position = page_height - top_margin
|
||||
|
||||
# Draw the line
|
||||
c.drawString(left_margin, y_position, line)
|
||||
y_position -= config["line_spacing"]
|
||||
|
||||
c.save()
|
||||
print(f" Created: {output_pdf_path}")
|
||||
|
||||
|
||||
def rasterize_pdf_to_scanned(pdf_path, scanned_pdf_path, dpi=300):
|
||||
"""Rasterize a PDF back to PDF at specified DPI (simulating a scan)."""
|
||||
try:
|
||||
from PIL import Image
|
||||
import tempfile
|
||||
import subprocess
|
||||
|
||||
# Use pdftoppm to convert PDF to images at specified DPI
|
||||
with tempfile.TemporaryDirectory() as tmpdir:
|
||||
# Convert PDF to PPM images
|
||||
result = subprocess.run(
|
||||
["pdftoppm", "-r", str(dpi), pdf_path, os.path.join(tmpdir, "page")],
|
||||
capture_output=True,
|
||||
text=True
|
||||
)
|
||||
|
||||
if result.returncode != 0:
|
||||
print(f" Warning: pdftoppm failed, copying original PDF")
|
||||
import shutil
|
||||
shutil.copy(pdf_path, scanned_pdf_path)
|
||||
return
|
||||
|
||||
# Convert images back to PDF
|
||||
images = sorted(Path(tmpdir).glob("page-*.ppm"))
|
||||
|
||||
if not images:
|
||||
print(f" Warning: No images generated, copying original PDF")
|
||||
import shutil
|
||||
shutil.copy(pdf_path, scanned_pdf_path)
|
||||
return
|
||||
|
||||
# Convert images to PDF using img2pdf or PIL
|
||||
try:
|
||||
import img2pdf
|
||||
with open(scanned_pdf_path, "wb") as f:
|
||||
f.write(img2pdf.convert([str(img) for img in images]))
|
||||
print(f" Created scanned: {scanned_pdf_path}")
|
||||
except ImportError:
|
||||
# Fallback to PIL
|
||||
pdf_images = []
|
||||
for img_path in images:
|
||||
img = Image.open(str(img_path))
|
||||
pdf_images.append(img.convert('RGB'))
|
||||
|
||||
if pdf_images:
|
||||
pdf_images[0].save(
|
||||
scanned_pdf_path,
|
||||
save_all=True,
|
||||
append_images=pdf_images[1:],
|
||||
resolution=dpi
|
||||
)
|
||||
print(f" Created scanned: {scanned_pdf_path}")
|
||||
|
||||
except Exception as e:
|
||||
print(f" Warning: Rasterization failed ({e}), using original PDF")
|
||||
import shutil
|
||||
shutil.copy(pdf_path, scanned_pdf_path)
|
||||
|
||||
|
||||
def generate_all_fixtures():
|
||||
"""Generate all fixture PDFs."""
|
||||
script_dir = Path(__file__).parent
|
||||
|
||||
for fixture in FIXTURES:
|
||||
name = fixture["name"]
|
||||
fixture_dir = script_dir / fixture["dir"]
|
||||
txt_path = fixture_dir / f"{name}.txt"
|
||||
pdf_path = fixture_dir / f"{name}.pdf"
|
||||
|
||||
print(f"Generating {name}...")
|
||||
|
||||
if not txt_path.exists():
|
||||
print(f" Error: {txt_path} not found")
|
||||
continue
|
||||
|
||||
try:
|
||||
# Create the PDF from text
|
||||
create_pdf_from_text(txt_path, pdf_path, fixture)
|
||||
|
||||
# Optionally rasterize to simulate a scan
|
||||
# This step requires pdftoppm (poppler-utils)
|
||||
scanned_path = fixture_dir / f"{name}-scanned.pdf"
|
||||
rasterize_pdf_to_scanned(pdf_path, scanned_path, dpi=300)
|
||||
|
||||
print(f" Success: {name}")
|
||||
except Exception as e:
|
||||
print(f" Error generating {name}: {e}")
|
||||
import traceback
|
||||
traceback.print_exc()
|
||||
|
||||
|
||||
def main():
|
||||
"""Main entry point."""
|
||||
print("Generating scanned fixture PDFs...")
|
||||
print("=" * 60)
|
||||
|
||||
if len(sys.argv) > 1:
|
||||
# Generate specific fixture
|
||||
fixture_name = sys.argv[1]
|
||||
for fixture in FIXTURES:
|
||||
if fixture["name"] == fixture_name:
|
||||
script_dir = Path(__file__).parent
|
||||
fixture_dir = script_dir / fixture["dir"]
|
||||
txt_path = fixture_dir / f"{fixture_name}.txt"
|
||||
pdf_path = fixture_dir / f"{fixture_name}.pdf"
|
||||
|
||||
if txt_path.exists():
|
||||
print(f"Generating {fixture_name}...")
|
||||
create_pdf_from_text(txt_path, pdf_path, fixture)
|
||||
print(f" Created: {pdf_path}")
|
||||
else:
|
||||
print(f" Error: {txt_path} not found")
|
||||
break
|
||||
else:
|
||||
print(f"Unknown fixture: {fixture_name}")
|
||||
print(f"Available fixtures: {', '.join(f['name'] for f in FIXTURES)}")
|
||||
else:
|
||||
# Generate all fixtures
|
||||
generate_all_fixtures()
|
||||
|
||||
print("=" * 60)
|
||||
print("Done!")
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
main()
|
||||
118
tests/fixtures/scanned/generate_scanned_fixtures.rs
vendored
Normal file
118
tests/fixtures/scanned/generate_scanned_fixtures.rs
vendored
Normal file
|
|
@ -0,0 +1,118 @@
|
|||
//! Generate scanned fixture PDFs from ground truth text files.
|
||||
//!
|
||||
//! This is a Rust-native alternative to the Python generator.
|
||||
//! Run with: cargo run --bin generate_scanned_fixtures
|
||||
|
||||
use std::fs::{self, File};
|
||||
use std::io::{BufWriter, Write};
|
||||
use std::path::Path;
|
||||
|
||||
fn main() -> Result<(), Box<dyn std::error::Error>> {
|
||||
println!("Generating scanned fixture metadata...");
|
||||
|
||||
// Ensure directories exist
|
||||
create_directories()?;
|
||||
|
||||
// Generate fixture metadata
|
||||
generate_fixture_metadata()?;
|
||||
|
||||
println!("\nScanned fixtures corpus structure created.");
|
||||
println!("\nNOTE: Actual PDF generation requires external tools.");
|
||||
println!("Options:");
|
||||
println!(" 1. Use Python script: generate_scanned_fixtures.py");
|
||||
println!(" 2. Manual generation (see GEN_MANIFEST.md)");
|
||||
println!(" 3. Use printpdf or similar crate for native Rust generation");
|
||||
|
||||
Ok(())
|
||||
}
|
||||
|
||||
fn create_directories() -> Result<(), Box<dyn std::error::Error>> {
|
||||
let dirs = [
|
||||
"tests/fixtures/scanned/receipt",
|
||||
"tests/fixtures/scanned/documents",
|
||||
"tests/fixtures/scanned/multi-page",
|
||||
];
|
||||
|
||||
for dir in &dirs {
|
||||
fs::create_dir_all(dir)?;
|
||||
println!("Created directory: {}", dir);
|
||||
}
|
||||
|
||||
Ok(())
|
||||
}
|
||||
|
||||
fn generate_fixture_metadata() -> Result<(), Box<dyn std::error::Error>> {
|
||||
// Create a simple fixture list for reference
|
||||
let fixtures = vec![
|
||||
FixtureSpec {
|
||||
name: "receipt-300dpi",
|
||||
dir: "receipt",
|
||||
font: "Helvetica",
|
||||
font_size: 10,
|
||||
pages: 1,
|
||||
wer_target: 3.0,
|
||||
},
|
||||
FixtureSpec {
|
||||
name: "invoice-300dpi",
|
||||
dir: "documents",
|
||||
font: "Helvetica",
|
||||
font_size: 11,
|
||||
pages: 1,
|
||||
wer_target: 3.0,
|
||||
},
|
||||
FixtureSpec {
|
||||
name: "form-300dpi",
|
||||
dir: "documents",
|
||||
font: "Helvetica",
|
||||
font_size: 11,
|
||||
pages: 1,
|
||||
wer_target: 3.0,
|
||||
},
|
||||
FixtureSpec {
|
||||
name: "doc-10page-300dpi",
|
||||
dir: "multi-page",
|
||||
font: "Times-Roman",
|
||||
font_size: 12,
|
||||
pages: 10,
|
||||
wer_target: 3.0,
|
||||
},
|
||||
];
|
||||
|
||||
let manifest_path = "tests/fixtures/scanned/.fixtures.json";
|
||||
let file = File::create(manifest_path)?;
|
||||
let mut writer = BufWriter::new(file);
|
||||
|
||||
writeln!(writer, "{{")?;
|
||||
writeln!(writer, " \"fixtures\": [")?;
|
||||
|
||||
for (i, fixture) in fixtures.iter().enumerate() {
|
||||
writeln!(
|
||||
writer,
|
||||
" {}{{",
|
||||
if i == 0 { "" } else { ",\n" }
|
||||
)?;
|
||||
writeln!(writer, r#" "name": "{}","#, fixture.name)?;
|
||||
writeln!(writer, r#" "dir": "{}","#, fixture.dir)?;
|
||||
writeln!(writer, r#" "font": "{}","#, fixture.font)?;
|
||||
writeln!(writer, r#" "font_size": {},"#, fixture.font_size)?;
|
||||
writeln!(writer, r#" "pages": {},"#, fixture.pages)?;
|
||||
writeln!(writer, r#" "wer_target": {}"#, fixture.wer_target)?;
|
||||
write!(writer, " }}")?;
|
||||
}
|
||||
|
||||
writeln!(writer, "\n ]")?;
|
||||
writeln!(writer, "}}")?;
|
||||
|
||||
println!("Created fixture manifest: {}", manifest_path);
|
||||
|
||||
Ok(())
|
||||
}
|
||||
|
||||
struct FixtureSpec<'a> {
|
||||
name: &'a str,
|
||||
dir: &'a str,
|
||||
font: &'a str,
|
||||
font_size: u32,
|
||||
pages: u32,
|
||||
wer_target: f64,
|
||||
}
|
||||
255
tests/fixtures/scanned/multi-page/doc-10page-300dpi.txt
vendored
Normal file
255
tests/fixtures/scanned/multi-page/doc-10page-300dpi.txt
vendored
Normal file
|
|
@ -0,0 +1,255 @@
|
|||
Page 1: INTRODUCTION
|
||||
|
||||
This document serves as a comprehensive test fixture for OCR performance evaluation across multiple pages. The fixture contains ten pages of diverse content types to stress-test the OCR pipeline while providing reproducible benchmarks for performance regression testing.
|
||||
|
||||
The primary objective is to measure OCR processing time and accuracy on a multi-page document. The performance target is to complete OCR on all ten pages in less than thirty seconds on a standard four-core CI runner. The accuracy target is a Word Error Rate (WER) of less than three percent.
|
||||
|
||||
Page 2: TEXT HEAVY CONTENT
|
||||
|
||||
Chapter One: Overview
|
||||
|
||||
Optical Character Recognition (OCR) technology has evolved significantly over the past decade. Modern OCR systems can achieve high accuracy rates on clean documents with standard fonts and good resolution. The key factors affecting OCR accuracy include scan quality, document complexity, font type, and language model quality.
|
||||
|
||||
Tesseract OCR, an open-source engine maintained by Google, supports over one hundred languages and provides competitive accuracy for many document types. The integration of Tesseract into document processing pipelines requires careful configuration of preprocessing steps, page segmentation modes, and language models.
|
||||
|
||||
This paragraph tests the system's ability to handle standard English prose with common vocabulary and sentence structures. The text should be recognized with minimal errors when scanned at three hundred dots per inch using a clear, readable font.
|
||||
|
||||
Page 3: FORM-LIKE STRUCTURE
|
||||
|
||||
SERVICE REQUEST FORM
|
||||
|
||||
Request ID: _______________ Date: _______________ Priority: [ ] High [ ] Medium [ ] Low
|
||||
|
||||
Customer Information:
|
||||
Name: _____________________________________________ Account Number: _________________
|
||||
Organization: ______________________________________ Email: _____________________________
|
||||
Address: __________________________________________ Phone: ____________________________
|
||||
City: _______________ State: ___ ZIP: _______________
|
||||
|
||||
Service Details:
|
||||
Service Type: [ ] Installation [ ] Maintenance [ ] Repair [ ] Consultation
|
||||
Equipment Model: ________________________________ Serial Number: ____________________
|
||||
Problem Description: ___________________________________________________________________
|
||||
_______________________________________________________________________________________
|
||||
|
||||
Preferred Appointment: ___ / ___ / _______ Time: ________ AM / PM
|
||||
|
||||
Technician Notes:
|
||||
_______________________________________________________________________________________
|
||||
_______________________________________________________________________________________
|
||||
|
||||
Customer Signature: __________________________ Date: _________ Technician: _____________
|
||||
|
||||
Page 4: TABLE DATA
|
||||
|
||||
QUARTERLY SALES REPORT - Q2 2026
|
||||
|
||||
+------------------+--------+--------+--------+--------+---------+
|
||||
| Region | April | May | June | Total | Growth |
|
||||
+------------------+--------+--------+--------+--------+---------+
|
||||
| Northeast | 45,200 | 47,800 | 51,300 | 144,300| +13.5% |
|
||||
| Southeast | 38,500 | 40,100 | 42,900 | 121,500| +11.4% |
|
||||
| Midwest | 52,300 | 49,700 | 54,600 | 156,600| +4.4% |
|
||||
| Southwest | 41,800 | 44,200 | 46,700 | 132,700| +11.7% |
|
||||
| Northwest | 35,900 | 37,500 | 39,200 | 112,600| +9.2% |
|
||||
| West | 48,700 | 51,300 | 53,800 | 153,800| +10.5% |
|
||||
+------------------+--------+--------+--------+--------+---------+
|
||||
| TOTAL | 262,400| 270,600| 288,500| 821,500| +9.9% |
|
||||
+------------------+--------+--------+--------+--------+---------+
|
||||
|
||||
Key Metrics:
|
||||
- Best Performing Region: Midwest ($156,600)
|
||||
- Highest Growth Rate: Northeast (+13.5%)
|
||||
- Quarterly Goal: $800,000 - ACHIEVED
|
||||
- Year-to-Date: $1,645,000
|
||||
|
||||
Page 5: TECHNICAL SPECIFICATIONS
|
||||
|
||||
API Documentation: DocumentProcessor
|
||||
|
||||
Class: DocumentProcessor
|
||||
Package: com.example.ocr.processing
|
||||
|
||||
Constructor:
|
||||
DocumentProcessor(OCREngine engine, ProcessingOptions options)
|
||||
|
||||
Methods:
|
||||
+ ExtractionResult processDocument(InputStream pdfStream)
|
||||
+ List<TextRegion> extractTextRegions(Page page)
|
||||
+ BufferedImage preprocessImage(BufferedImage image, PreprocessMode mode)
|
||||
+ void setLanguage(List<String> languageCodes)
|
||||
+ ProcessingStatistics getStatistics()
|
||||
|
||||
Configuration Options:
|
||||
- dpi: Integer (default: 300) - Rendering resolution for OCR
|
||||
- pageSegmentationMode: PSM (default: AUTO) - Page layout analysis
|
||||
- ocrEngineMode: OEM (default: LSTM_ONLY) - Neural network engine
|
||||
- whitelist: String (default: null) - Character whitelist
|
||||
- blacklist: String (default: null) - Character blacklist
|
||||
|
||||
Example Usage:
|
||||
OCREngine tesseract = new TesseractOCREngine();
|
||||
ProcessingOptions options = new ProcessingOptions.Builder()
|
||||
.setDpi(300)
|
||||
.setPageSegmentationMode(PSM.AUTO)
|
||||
.addLanguage("eng")
|
||||
.build();
|
||||
DocumentProcessor processor = new DocumentProcessor(tesseract, options);
|
||||
ExtractionResult result = processor.processDocument(pdfInputStream);
|
||||
|
||||
Page 6: LEGAL TEXT
|
||||
|
||||
SOFTWARE LICENSE AGREEMENT
|
||||
|
||||
1. GRANT OF LICENSE
|
||||
Subject to the terms of this agreement, the Licensor grants you a non-exclusive, non-transferable license to use the Software for internal business operations. The Software may be installed on up to five computers within your organization.
|
||||
|
||||
2. RESTRICTIONS
|
||||
You may not: (a) modify, adapt, or create derivative works; (b) reverse engineer, decompile, or disassemble the Software; (c) distribute, transfer, or sublicense the Software to any third party; (d) use the Software for competitive analysis or benchmarking.
|
||||
|
||||
3. INTELLECTUAL PROPERTY
|
||||
All intellectual property rights in the Software, including patents, copyrights, trade secrets, and trademarks, remain the exclusive property of the Licensor. You acknowledge that the Software contains proprietary and confidential information.
|
||||
|
||||
4. WARRANTY DISCLAIMER
|
||||
THE SOFTWARE IS PROVIDED "AS IS" WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE, AND NONINFRINGEMENT. IN NO EVENT SHALL THE LICENSOR BE LIABLE FOR ANY CLAIM, DAMAGES, OR OTHER LIABILITY.
|
||||
|
||||
5. TERMINATION
|
||||
This license is effective until terminated. Your rights under this license will terminate automatically without notice if you fail to comply with any term. Upon termination, you must cease all use of the Software and destroy all copies.
|
||||
|
||||
Page 7: FINANCIAL STATEMENT
|
||||
|
||||
BALANCE SHEET - As of December 31, 2026
|
||||
|
||||
ASSETS
|
||||
Current Assets:
|
||||
Cash and Cash Equivalents $ 245,800
|
||||
Accounts Receivable $ 178,500
|
||||
Inventory $ 125,300
|
||||
Prepaid Expenses $ 18,200
|
||||
Total Current Assets $ 567,800
|
||||
|
||||
Non-Current Assets:
|
||||
Property, Plant & Equipment $ 785,000
|
||||
Less: Accumulated Depreciation ($ 245,000)
|
||||
Net PPE $ 540,000
|
||||
Intangible Assets $ 95,000
|
||||
Long-term Investments $ 125,000
|
||||
Total Non-Current Assets $ 760,000
|
||||
|
||||
TOTAL ASSETS $1,327,800
|
||||
|
||||
LIABILITIES AND EQUITY
|
||||
Current Liabilities:
|
||||
Accounts Payable $ 125,500
|
||||
Accrued Expenses $ 45,200
|
||||
Short-term Debt $ 75,000
|
||||
Total Current Liabilities $ 245,700
|
||||
|
||||
Long-term Liabilities:
|
||||
Long-term Debt $ 350,000
|
||||
Deferred Tax Liability $ 28,500
|
||||
Total Long-term Liabilities $ 378,500
|
||||
|
||||
Shareholders' Equity:
|
||||
Common Stock $ 250,000
|
||||
Retained Earnings $ 453,600
|
||||
Total Equity $ 703,600
|
||||
|
||||
TOTAL LIABILITIES AND EQUITY $1,327,800
|
||||
|
||||
Page 8: CORRESPONDENCE
|
||||
|
||||
Dear Valued Customer,
|
||||
|
||||
We are writing to inform you of important updates to your service account that will take effect on July 1st, 2026. These changes are part of our ongoing commitment to provide you with the highest quality service and support.
|
||||
|
||||
Account Details:
|
||||
- Account Number: ACCT-2026-78542
|
||||
- Service Plan: Premium Business
|
||||
- Current Monthly Rate: $89.99
|
||||
- New Monthly Rate: $94.99
|
||||
|
||||
What is changing:
|
||||
- Enhanced security monitoring at no additional cost
|
||||
- 24/7 priority customer support
|
||||
- Monthly usage analytics reporting
|
||||
- Extended data retention from 30 to 90 days
|
||||
|
||||
Action Required:
|
||||
Please confirm your acceptance of these updates by signing the enclosed authorization form and returning it by June 15th, 2026. If you have any questions or concerns, please contact our customer service team.
|
||||
|
||||
Customer Service Contact:
|
||||
- Phone: 1-800-555-0199
|
||||
- Email: support@service.example.com
|
||||
- Hours: Monday through Friday, 8:00 AM to 8:00 PM EST
|
||||
|
||||
Thank you for your continued business. We value your relationship and look forward to serving you in the years to come.
|
||||
|
||||
Sincerely,
|
||||
Customer Relations Department
|
||||
Service Solutions Inc.
|
||||
|
||||
Page 9: SCIENTIFIC CONTENT
|
||||
|
||||
Abstract: Evaluation of OCR Accuracy Metrics
|
||||
|
||||
This study presents a comprehensive evaluation of Word Error Rate (WER) as a primary metric for assessing Optical Character Recognition system performance. We conducted experiments across five document categories, four font families, and three scanning resolutions.
|
||||
|
||||
Methodology:
|
||||
Test Corpus: Five hundred documents sourced from public domain literature
|
||||
- One hundred business documents (invoices, receipts, forms)
|
||||
- One hundred technical documents (specifications, manuals)
|
||||
- One hundred literary works (novels, essays)
|
||||
- One hundred academic papers (journal articles)
|
||||
- One hundred legal documents (contracts, agreements)
|
||||
|
||||
Font Evaluation: Arial, Times New Roman, Helvetica, Courier
|
||||
Resolution Testing: 200 DPI, 300 DPI, 400 DPI
|
||||
|
||||
Results:
|
||||
WER by Font Family (300 DPI):
|
||||
- Arial: 1.8%
|
||||
- Times New Roman: 2.1%
|
||||
- Helvetica: 1.9%
|
||||
- Courier: 2.4%
|
||||
|
||||
WER by Resolution (Arial font):
|
||||
- 200 DPI: 4.2%
|
||||
- 300 DPI: 1.8%
|
||||
- 400 DPI: 1.5%
|
||||
|
||||
Conclusion:
|
||||
Three hundred DPI provides the optimal balance between accuracy and processing efficiency for most document types. Serif fonts exhibit slightly higher WER than sans-serif fonts. Monospace fonts show the highest error rates due to character spacing ambiguity.
|
||||
|
||||
Page 10: SUMMARY AND CONCLUSION
|
||||
|
||||
This ten-page fixture document has demonstrated the following content types for OCR testing:
|
||||
|
||||
Content Distribution:
|
||||
1. Introduction and Overview
|
||||
2. Text-heavy Technical Documentation
|
||||
3. Form with Fields and Checkboxes
|
||||
4. Tabular Data with Formatting
|
||||
5. API Technical Specifications
|
||||
6. Legal Terms and Conditions
|
||||
7. Financial Balance Sheet
|
||||
8. Business Correspondence
|
||||
9. Scientific Abstract and Methodology
|
||||
10. Summary and Conclusions
|
||||
|
||||
Performance Benchmarks:
|
||||
- Target Processing Time: < 30 seconds (10 pages at ~3 seconds per page)
|
||||
- Target Throughput: > 20 pages per minute on 4-core CI runner
|
||||
- Target Memory Usage: < 500 MB per worker thread
|
||||
- Target WER: < 3% average across all pages
|
||||
|
||||
Quality Metrics:
|
||||
- Clean Text WER: < 2% (pages with standard prose)
|
||||
- Table Cell Accuracy: > 95% (tabular data pages)
|
||||
- Form Field Accuracy: > 90% (forms and structured documents)
|
||||
- Overall Document WER: < 3% (comprehensive measure)
|
||||
|
||||
Next Steps:
|
||||
For comprehensive OCR validation, process this fixture using the standard pipeline and report per-page WER statistics. Identify any pages exceeding 5% WER for manual review and potential preprocessing optimization.
|
||||
|
||||
End of Test Fixture Document.
|
||||
37
tests/fixtures/scanned/receipt/receipt-300dpi.txt
vendored
Normal file
37
tests/fixtures/scanned/receipt/receipt-300dpi.txt
vendored
Normal file
|
|
@ -0,0 +1,37 @@
|
|||
SUPERMARKET RECEIPT
|
||||
|
||||
Store: Fresh Groceries Market
|
||||
Address: 123 Main Street, Anytown, USA
|
||||
Phone: (555) 123-4567
|
||||
|
||||
Date: 05/28/2026 Time: 14:35:42
|
||||
Register: 3 Transaction: 104257 Cashier: 45
|
||||
|
||||
ITEM QTY PRICE TOTAL
|
||||
----------------------------------------------------------------
|
||||
MILK 2% GALLON 1 $4.29 $4.29
|
||||
BREAD WHOLE WHEAT 1 $2.99 $2.99
|
||||
EGGS LARGE DOZEN 1 $3.49 $3.49
|
||||
BANANAS LB 2 $0.59 $1.18
|
||||
APPLES GALA LB 1 $1.79 $1.79
|
||||
CHICKEN BREAST 2 LB 1 $8.98 $8.98
|
||||
PASTA SPAGHETTI 1 LB 2 $1.29 $2.58
|
||||
TOMATO SAUCE 24 OZ 2 $2.19 $4.38
|
||||
CHEESE CHEDDAR 8 OZ 1 $3.79 $3.79
|
||||
YOGURT GREEK 4 PK 1 $5.49 $5.49
|
||||
COFFEE GROUND 12 OZ 1 $7.99 $7.99
|
||||
PAPER TOWELS 1 $8.99 $8.99
|
||||
DETERGENT LIQUID 50 OZ 1 $11.99 $11.99
|
||||
----------------------------------------------------------------
|
||||
SUBTOTAL $67.93
|
||||
TAX 8.5% $5.77
|
||||
----------------------------------------------------------------
|
||||
TOTAL $73.70
|
||||
|
||||
CASH $80.00
|
||||
CHANGE $6.30
|
||||
|
||||
Thank you for shopping with us!
|
||||
Please visit freshgroceries.example.com for savings.
|
||||
No returns without original receipt.
|
||||
Store Manager: John Smith
|
||||
70
tests/fixtures/scanned/wer_gate_stub.rs
vendored
Normal file
70
tests/fixtures/scanned/wer_gate_stub.rs
vendored
Normal file
|
|
@ -0,0 +1,70 @@
|
|||
//! Stub for WER (Word Error Rate) gate test.
|
||||
//!
|
||||
//! This test will be implemented when the scanned PDF fixtures are fully generated.
|
||||
//! It serves as a placeholder for the <3% WER Tier 1 OCR gate.
|
||||
|
||||
#[cfg(test)]
|
||||
mod wer_gate_tests {
|
||||
// TODO: Implement WER calculation
|
||||
// TODO: Test each fixture against ground truth
|
||||
// TODO: Verify WER < 3% for clean 300-DPI scans
|
||||
// TODO: Verify processing time < 30s for 10-page fixture
|
||||
|
||||
#[test]
|
||||
#[ignore = "Waiting for scanned PDF generation (bf-2he4t)"]
|
||||
fn test_receipt_300dpi_wer() {
|
||||
// pdftract extract tests/fixtures/scanned/receipt/receipt-300dpi.pdf --ocr --text
|
||||
// Compare output with receipt-300dpi.txt
|
||||
// Assert WER < 3%
|
||||
}
|
||||
|
||||
#[test]
|
||||
#[ignore = "Waiting for scanned PDF generation (bf-2he4t)"]
|
||||
fn test_invoice_300dpi_wer() {
|
||||
// Similar to receipt test
|
||||
}
|
||||
|
||||
#[test]
|
||||
#[ignore = "Waiting for scanned PDF generation (bf-2he4t)"]
|
||||
fn test_form_300dpi_wer() {
|
||||
// Similar to receipt test
|
||||
}
|
||||
|
||||
#[test]
|
||||
#[ignore = "Waiting for scanned PDF generation (bf-2he4t)"]
|
||||
fn test_doc_10page_300dpi_wer() {
|
||||
// Multi-page test
|
||||
// Verify average WER < 3%
|
||||
// Verify no page exceeds 5% WER
|
||||
}
|
||||
|
||||
#[test]
|
||||
#[ignore = "Waiting for scanned PDF generation (bf-2he4t)"]
|
||||
fn test_10page_performance() {
|
||||
// Verify processing time < 30s on 4-core CI
|
||||
}
|
||||
|
||||
#[test]
|
||||
#[ignore = "Waiting for scanned PDF generation (bf-2he4t)"]
|
||||
fn test_as_02_scenario() {
|
||||
// AS-02: Extract a scanned receipt via OCR
|
||||
// Setup: receipt-300dpi.pdf
|
||||
// Action: pdftract extract receipt-300dpi.pdf --ocr --text
|
||||
// Verify: WER < 3%, total line present, latency < 30s
|
||||
}
|
||||
}
|
||||
|
||||
// Helper functions to be implemented:
|
||||
|
||||
// fn calculate_wer(ground_truth: &str, hypothesis: &str) -> f64 {
|
||||
// // Implement Levenshtein distance-based WER calculation
|
||||
// // WER = (substitutions + insertions + deletions) / total_words
|
||||
// }
|
||||
|
||||
// fn extract_text_from_pdf(pdf_path: &str) -> Result<String, Error> {
|
||||
// // Use pdftract CLI or library API
|
||||
// }
|
||||
|
||||
// fn load_ground_truth(fixture_name: &str) -> String {
|
||||
// // Load from corresponding .txt file
|
||||
// }
|
||||
Loading…
Add table
Reference in a new issue