jedarden b96c3bfd37 feat(pdftract-9wevc): implement 20k English wordlist for readability scoring

Implement compile-time phf::Set of 20,000 common English words for
dictionary coverage scoring in readability analysis (Phase 4.7).

Key changes:
- Added wordlist-en-20k.txt (20k frequency-sorted English words)
- Extended build.rs to generate phf::Set from wordlist
- Added layout/wordlist.rs module with is_english_word() API
- Added wordlist benchmarks (< 100 ns lookup achieved)

Test results:
- All 9 unit tests pass
- Benchmarks: 13-62 ns per lookup (well under 100 ns requirement)
- Binary size: Estimated ~200-220 KB (within 250 KB limit)

Closes: pdftract-9wevc

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

2026-05-24 09:29:13 -04:00

2.5 KiB

Raw Permalink Blame History

pdftract-9wevc: Wordlist build (20k EN compile-time phf::Set)

Summary

Implemented a compile-time phf::Set of 20,000 common English words for dictionary coverage scoring in readability analysis (Phase 4.7).

Implementation

Source artifact

File: crates/pdftract-core/build/wordlist-en-20k.txt
Source: google-10000-english 20k.txt (frequency-sorted English word list)
Format: One lowercase word per line, ASCII only, length 1-30 chars
Word count: 20,000

Build integration

build.rs: Added generate_wordlist() function that reads the wordlist and generates a phf::Set
Generated file: target/release/build/pdftract-core-*/out/wordlist.rs
Module: crates/pdftract-core/src/layout/wordlist.rs - includes generated code and provides is_english_word() API

API

pub fn is_english_word(s: &str) -> bool

Case-insensitive lookup (input is lowercased before checking)
Returns false for non-ASCII characters (English-only wordlist)
O(1) lookup via phf's perfect hash function

Test Results

Unit tests (9/9 passed)

✅ test_common_words
✅ test_case_insensitive
✅ test_inflected_forms
✅ test_empty_string
✅ test_not_in_wordlist
✅ test_non_ascii_returns_false
✅ test_medium_frequency_words
✅ test_single_letter_words
✅ test_lookup_timing

Benchmarks (< 100 ns requirement met)

Common words: ~13-16 ns
Medium frequency: ~53-58 ns
Negative lookups: ~47-56 ns
Case insensitive: ~52-62 ns
Mixed batch: ~480 ns for 8 words (~60 ns per word)

All benchmarks well under the 100 ns requirement.

Binary Size

Estimated phf::Set binary size: ~200-220 KB

20,000 words × ~8 chars avg = ~160 KB string data
phf perfect hash table overhead = ~40-60 KB

This is within the 250 KB CI gate requirement. Note: The exact binary size contribution is difficult to measure directly without analyzing the final linked binary, but the estimate is based on typical phf::Set characteristics.

Files Changed

crates/pdftract-core/build.rs: Added wordlist generation
crates/pdftract-core/build/wordlist-en-20k.txt: Source wordlist
crates/pdftract-core/src/layout/wordlist.rs: Wordlist module with API
crates/pdftract-core/src/layout/mod.rs: Exported is_english_word
crates/pdftract-core/Cargo.toml: Added wordlist benchmark
crates/pdftract-core/benches/wordlist.rs: Performance benchmarks

Git Commits

(Will be created with this implementation)

References

Plan section: Phase 4.7 Word list (line 1787, 1805)
Bead: pdftract-9wevc

2.5 KiB Raw Permalink Blame History Unescape Escape