pdftract/notes/pdftract-3uq.md
jedarden 46c515e255 feat(pdftract-3uq): add font type classifier and subset prefix stripper
Implement FontKind enum and classify_font() function for Phase 2.1
font type detection. Includes strip_subset_prefix() for handling
font subset names (e.g., ABCDEF+Times-Roman).

FontKind variants:
- Type1, Type1Std14 (Standard 14)
- TrueType, OpenTypeCFF
- Type0, CIDFontType0, CIDFontType2
- Type3

Classifier reads /Subtype, /BaseFont, and for Type0 fonts, descendant
CIDFont subtype. OpenTypeCFF detected via /FontDescriptor /FontFile3
with /Subtype /OpenType.

All 27 font tests pass.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-23 13:42:57 -04:00

77 lines
3.3 KiB
Markdown

# pdftract-3uq: Font subtype classifier and BaseFont prefix stripper
## Summary
Implemented the font type classification module (`crates/pdftract-core/src/font/mod.rs`) with:
1. **`FontKind` enum** - Represents all PDF font types:
- `Type1` - Non-Standard-14 Type 1 fonts
- `Type1Std14` - Standard 14 fonts (Times-Roman, Helvetica, Courier, Symbol, ZapfDingbats)
- `TrueType` - TrueType fonts
- `Type0` - Composite fonts with descendant CIDFonts
- `CIDFontType0` - CFF-based CID fonts
- `CIDFontType2` - TrueType-based CID fonts
- `Type3` - Bitmap/content-stream defined fonts
- `OpenTypeCFF` - OpenType fonts with CFF data
2. **`strip_subset_prefix(name: &str) -> &str`** - Removes 6-uppercase-letter subset prefix
- Exactly validates 6 ASCII uppercase letters + `+`
- Returns unchanged for invalid patterns (too short, lowercase, no prefix)
3. **`classify_font(font_dict: &PdfDict) -> FontKind`** - Classifies fonts by:
- Reading `/Subtype` to get base font type
- Checking Standard 14 font names (with or without subset prefix)
- For Type0 fonts, reading descendant CIDFont's `/Subtype`
- Checking `/FontDescriptor` for `/FontFile3` with `/Subtype /OpenType` to distinguish OpenTypeCFF
## Acceptance Criteria Status
| Criterion | Status | Notes |
|-----------|--------|-------|
| Unit tests for all 8 FontKind branches | PASS | 21 font-specific tests cover all branches |
| `strip_subset_prefix("ABCDEF+Times-Roman") == "Times-Roman"` | PASS | Tested in `test_strip_subset_prefix_valid` |
| `strip_subset_prefix("ABCD+Foo") == "ABCD+Foo"` | PASS | Tested in `test_strip_subset_prefix_too_short` |
| `strip_subset_prefix("abcdef+Foo") == "abcdef+Foo"` | PASS | Tested in `test_strip_subset_prefix_lowercase` |
| Std-14 detection ignores subset prefix | PASS | Tested in `test_is_standard_14_font` and `test_classify_font_type1_standard_with_subset` |
## Implementation Details
### FontKind enum methods
- `is_standard_14()` - Returns true for Type1Std14
- `is_cid_font()` - Returns true for Type0, CIDFontType0, CIDFontType2
- `is_type3()` - Returns true for Type3 fonts
### Standard 14 fonts
The hardcoded list includes all 14 canonical names:
- Times family: Times-Roman, Times-Bold, Times-Italic, Times-BoldItalic
- Helvetica family: Helvetica, Helvetica-Bold, Helvetica-Oblique, Helvetica-BoldOblique
- Courier family: Courier, Courier-Bold, Courier-Oblique, Courier-BoldOblique
- Symbol, ZapfDingbats
### Edge cases handled
- `/Subtype` with or without leading slash
- Missing `/Subtype` (defaults to Type1)
- Empty or missing `/DescendantFonts` array for Type0 fonts
- Indirect references to FontDescriptor or DescendantFonts (skipped, returns default)
## Files Modified
- `crates/pdftract-core/src/lib.rs` - Added `pub mod font;`
- `crates/pdftract-core/src/font/mod.rs` - New module with FontKind enum and classifier functions
## Testing
All 27 font-related tests pass:
- 21 tests in font::tests
- 6 tests in other modules that reference font types
Test coverage includes:
- Subset prefix stripping (valid, invalid, edge cases)
- Standard 14 font detection (with and without prefix)
- All 8 FontKind variants
- Type0 with CIDFont descendants
- OpenTypeCFF detection via FontDescriptor
## Commit
`git commit` to follow with conventional commit message citing this bead.