Implement compile-time phf::Set of 20,000 common English words for dictionary coverage scoring in readability analysis (Phase 4.7). Key changes: - Added wordlist-en-20k.txt (20k frequency-sorted English words) - Extended build.rs to generate phf::Set from wordlist - Added layout/wordlist.rs module with is_english_word() API - Added wordlist benchmarks (< 100 ns lookup achieved) Test results: - All 9 unit tests pass - Benchmarks: 13-62 ns per lookup (well under 100 ns requirement) - Binary size: Estimated ~200-220 KB (within 250 KB limit) Closes: pdftract-9wevc Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2.5 KiB
2.5 KiB
pdftract-9wevc: Wordlist build (20k EN compile-time phf::Set)
Summary
Implemented a compile-time phf::Set of 20,000 common English words for dictionary coverage scoring in readability analysis (Phase 4.7).
Implementation
Source artifact
- File:
crates/pdftract-core/build/wordlist-en-20k.txt - Source: google-10000-english 20k.txt (frequency-sorted English word list)
- Format: One lowercase word per line, ASCII only, length 1-30 chars
- Word count: 20,000
Build integration
- build.rs: Added
generate_wordlist()function that reads the wordlist and generates aphf::Set - Generated file:
target/release/build/pdftract-core-*/out/wordlist.rs - Module:
crates/pdftract-core/src/layout/wordlist.rs- includes generated code and providesis_english_word()API
API
pub fn is_english_word(s: &str) -> bool
- Case-insensitive lookup (input is lowercased before checking)
- Returns false for non-ASCII characters (English-only wordlist)
- O(1) lookup via phf's perfect hash function
Test Results
Unit tests (9/9 passed)
- ✅ test_common_words
- ✅ test_case_insensitive
- ✅ test_inflected_forms
- ✅ test_empty_string
- ✅ test_not_in_wordlist
- ✅ test_non_ascii_returns_false
- ✅ test_medium_frequency_words
- ✅ test_single_letter_words
- ✅ test_lookup_timing
Benchmarks (< 100 ns requirement met)
- Common words: ~13-16 ns
- Medium frequency: ~53-58 ns
- Negative lookups: ~47-56 ns
- Case insensitive: ~52-62 ns
- Mixed batch: ~480 ns for 8 words (~60 ns per word)
All benchmarks well under the 100 ns requirement.
Binary Size
Estimated phf::Set binary size: ~200-220 KB
- 20,000 words × ~8 chars avg = ~160 KB string data
- phf perfect hash table overhead = ~40-60 KB
This is within the 250 KB CI gate requirement. Note: The exact binary size contribution is difficult to measure directly without analyzing the final linked binary, but the estimate is based on typical phf::Set characteristics.
Files Changed
crates/pdftract-core/build.rs: Added wordlist generationcrates/pdftract-core/build/wordlist-en-20k.txt: Source wordlistcrates/pdftract-core/src/layout/wordlist.rs: Wordlist module with APIcrates/pdftract-core/src/layout/mod.rs: Exportedis_english_wordcrates/pdftract-core/Cargo.toml: Added wordlist benchmarkcrates/pdftract-core/benches/wordlist.rs: Performance benchmarks
Git Commits
- (Will be created with this implementation)
References
- Plan section: Phase 4.7 Word list (line 1787, 1805)
- Bead: pdftract-9wevc