pdftract/notes/pdftract-3uq.md
jedarden 46c515e255 feat(pdftract-3uq): add font type classifier and subset prefix stripper
Implement FontKind enum and classify_font() function for Phase 2.1
font type detection. Includes strip_subset_prefix() for handling
font subset names (e.g., ABCDEF+Times-Roman).

FontKind variants:
- Type1, Type1Std14 (Standard 14)
- TrueType, OpenTypeCFF
- Type0, CIDFontType0, CIDFontType2
- Type3

Classifier reads /Subtype, /BaseFont, and for Type0 fonts, descendant
CIDFont subtype. OpenTypeCFF detected via /FontDescriptor /FontFile3
with /Subtype /OpenType.

All 27 font tests pass.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-23 13:42:57 -04:00

3.3 KiB

pdftract-3uq: Font subtype classifier and BaseFont prefix stripper

Summary

Implemented the font type classification module (crates/pdftract-core/src/font/mod.rs) with:

  1. FontKind enum - Represents all PDF font types:

    • Type1 - Non-Standard-14 Type 1 fonts
    • Type1Std14 - Standard 14 fonts (Times-Roman, Helvetica, Courier, Symbol, ZapfDingbats)
    • TrueType - TrueType fonts
    • Type0 - Composite fonts with descendant CIDFonts
    • CIDFontType0 - CFF-based CID fonts
    • CIDFontType2 - TrueType-based CID fonts
    • Type3 - Bitmap/content-stream defined fonts
    • OpenTypeCFF - OpenType fonts with CFF data
  2. strip_subset_prefix(name: &str) -> &str - Removes 6-uppercase-letter subset prefix

    • Exactly validates 6 ASCII uppercase letters + +
    • Returns unchanged for invalid patterns (too short, lowercase, no prefix)
  3. classify_font(font_dict: &PdfDict) -> FontKind - Classifies fonts by:

    • Reading /Subtype to get base font type
    • Checking Standard 14 font names (with or without subset prefix)
    • For Type0 fonts, reading descendant CIDFont's /Subtype
    • Checking /FontDescriptor for /FontFile3 with /Subtype /OpenType to distinguish OpenTypeCFF

Acceptance Criteria Status

Criterion Status Notes
Unit tests for all 8 FontKind branches PASS 21 font-specific tests cover all branches
strip_subset_prefix("ABCDEF+Times-Roman") == "Times-Roman" PASS Tested in test_strip_subset_prefix_valid
strip_subset_prefix("ABCD+Foo") == "ABCD+Foo" PASS Tested in test_strip_subset_prefix_too_short
strip_subset_prefix("abcdef+Foo") == "abcdef+Foo" PASS Tested in test_strip_subset_prefix_lowercase
Std-14 detection ignores subset prefix PASS Tested in test_is_standard_14_font and test_classify_font_type1_standard_with_subset

Implementation Details

FontKind enum methods

  • is_standard_14() - Returns true for Type1Std14
  • is_cid_font() - Returns true for Type0, CIDFontType0, CIDFontType2
  • is_type3() - Returns true for Type3 fonts

Standard 14 fonts

The hardcoded list includes all 14 canonical names:

  • Times family: Times-Roman, Times-Bold, Times-Italic, Times-BoldItalic
  • Helvetica family: Helvetica, Helvetica-Bold, Helvetica-Oblique, Helvetica-BoldOblique
  • Courier family: Courier, Courier-Bold, Courier-Oblique, Courier-BoldOblique
  • Symbol, ZapfDingbats

Edge cases handled

  • /Subtype with or without leading slash
  • Missing /Subtype (defaults to Type1)
  • Empty or missing /DescendantFonts array for Type0 fonts
  • Indirect references to FontDescriptor or DescendantFonts (skipped, returns default)

Files Modified

  • crates/pdftract-core/src/lib.rs - Added pub mod font;
  • crates/pdftract-core/src/font/mod.rs - New module with FontKind enum and classifier functions

Testing

All 27 font-related tests pass:

  • 21 tests in font::tests
  • 6 tests in other modules that reference font types

Test coverage includes:

  • Subset prefix stripping (valid, invalid, edge cases)
  • Standard 14 font detection (with and without prefix)
  • All 8 FontKind variants
  • Type0 with CIDFont descendants
  • OpenTypeCFF detection via FontDescriptor

Commit

git commit to follow with conventional commit message citing this bead.