Implement FontKind enum and classify_font() function for Phase 2.1 font type detection. Includes strip_subset_prefix() for handling font subset names (e.g., ABCDEF+Times-Roman). FontKind variants: - Type1, Type1Std14 (Standard 14) - TrueType, OpenTypeCFF - Type0, CIDFontType0, CIDFontType2 - Type3 Classifier reads /Subtype, /BaseFont, and for Type0 fonts, descendant CIDFont subtype. OpenTypeCFF detected via /FontDescriptor /FontFile3 with /Subtype /OpenType. All 27 font tests pass. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
3.3 KiB
3.3 KiB
pdftract-3uq: Font subtype classifier and BaseFont prefix stripper
Summary
Implemented the font type classification module (crates/pdftract-core/src/font/mod.rs) with:
-
FontKindenum - Represents all PDF font types:Type1- Non-Standard-14 Type 1 fontsType1Std14- Standard 14 fonts (Times-Roman, Helvetica, Courier, Symbol, ZapfDingbats)TrueType- TrueType fontsType0- Composite fonts with descendant CIDFontsCIDFontType0- CFF-based CID fontsCIDFontType2- TrueType-based CID fontsType3- Bitmap/content-stream defined fontsOpenTypeCFF- OpenType fonts with CFF data
-
strip_subset_prefix(name: &str) -> &str- Removes 6-uppercase-letter subset prefix- Exactly validates 6 ASCII uppercase letters +
+ - Returns unchanged for invalid patterns (too short, lowercase, no prefix)
- Exactly validates 6 ASCII uppercase letters +
-
classify_font(font_dict: &PdfDict) -> FontKind- Classifies fonts by:- Reading
/Subtypeto get base font type - Checking Standard 14 font names (with or without subset prefix)
- For Type0 fonts, reading descendant CIDFont's
/Subtype - Checking
/FontDescriptorfor/FontFile3with/Subtype /OpenTypeto distinguish OpenTypeCFF
- Reading
Acceptance Criteria Status
| Criterion | Status | Notes |
|---|---|---|
| Unit tests for all 8 FontKind branches | PASS | 21 font-specific tests cover all branches |
strip_subset_prefix("ABCDEF+Times-Roman") == "Times-Roman" |
PASS | Tested in test_strip_subset_prefix_valid |
strip_subset_prefix("ABCD+Foo") == "ABCD+Foo" |
PASS | Tested in test_strip_subset_prefix_too_short |
strip_subset_prefix("abcdef+Foo") == "abcdef+Foo" |
PASS | Tested in test_strip_subset_prefix_lowercase |
| Std-14 detection ignores subset prefix | PASS | Tested in test_is_standard_14_font and test_classify_font_type1_standard_with_subset |
Implementation Details
FontKind enum methods
is_standard_14()- Returns true for Type1Std14is_cid_font()- Returns true for Type0, CIDFontType0, CIDFontType2is_type3()- Returns true for Type3 fonts
Standard 14 fonts
The hardcoded list includes all 14 canonical names:
- Times family: Times-Roman, Times-Bold, Times-Italic, Times-BoldItalic
- Helvetica family: Helvetica, Helvetica-Bold, Helvetica-Oblique, Helvetica-BoldOblique
- Courier family: Courier, Courier-Bold, Courier-Oblique, Courier-BoldOblique
- Symbol, ZapfDingbats
Edge cases handled
/Subtypewith or without leading slash- Missing
/Subtype(defaults to Type1) - Empty or missing
/DescendantFontsarray for Type0 fonts - Indirect references to FontDescriptor or DescendantFonts (skipped, returns default)
Files Modified
crates/pdftract-core/src/lib.rs- Addedpub mod font;crates/pdftract-core/src/font/mod.rs- New module with FontKind enum and classifier functions
Testing
All 27 font-related tests pass:
- 21 tests in font::tests
- 6 tests in other modules that reference font types
Test coverage includes:
- Subset prefix stripping (valid, invalid, edge cases)
- Standard 14 font detection (with and without prefix)
- All 8 FontKind variants
- Type0 with CIDFont descendants
- OpenTypeCFF detection via FontDescriptor
Commit
git commit to follow with conventional commit message citing this bead.