pdftract/notes/pdftract-9wevc.md
jedarden b96c3bfd37 feat(pdftract-9wevc): implement 20k English wordlist for readability scoring
Implement compile-time phf::Set of 20,000 common English words for
dictionary coverage scoring in readability analysis (Phase 4.7).

Key changes:
- Added wordlist-en-20k.txt (20k frequency-sorted English words)
- Extended build.rs to generate phf::Set from wordlist
- Added layout/wordlist.rs module with is_english_word() API
- Added wordlist benchmarks (< 100 ns lookup achieved)

Test results:
- All 9 unit tests pass
- Benchmarks: 13-62 ns per lookup (well under 100 ns requirement)
- Binary size: Estimated ~200-220 KB (within 250 KB limit)

Closes: pdftract-9wevc

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-24 09:29:13 -04:00

71 lines
2.5 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# pdftract-9wevc: Wordlist build (20k EN compile-time phf::Set)
## Summary
Implemented a compile-time `phf::Set` of 20,000 common English words for dictionary coverage scoring in readability analysis (Phase 4.7).
## Implementation
### Source artifact
- **File**: `crates/pdftract-core/build/wordlist-en-20k.txt`
- **Source**: google-10000-english 20k.txt (frequency-sorted English word list)
- **Format**: One lowercase word per line, ASCII only, length 1-30 chars
- **Word count**: 20,000
### Build integration
- **build.rs**: Added `generate_wordlist()` function that reads the wordlist and generates a `phf::Set`
- **Generated file**: `target/release/build/pdftract-core-*/out/wordlist.rs`
- **Module**: `crates/pdftract-core/src/layout/wordlist.rs` - includes generated code and provides `is_english_word()` API
### API
```rust
pub fn is_english_word(s: &str) -> bool
```
- Case-insensitive lookup (input is lowercased before checking)
- Returns false for non-ASCII characters (English-only wordlist)
- O(1) lookup via phf's perfect hash function
## Test Results
### Unit tests (9/9 passed)
- ✅ test_common_words
- ✅ test_case_insensitive
- ✅ test_inflected_forms
- ✅ test_empty_string
- ✅ test_not_in_wordlist
- ✅ test_non_ascii_returns_false
- ✅ test_medium_frequency_words
- ✅ test_single_letter_words
- ✅ test_lookup_timing
### Benchmarks (< 100 ns requirement met)
- Common words: ~13-16 ns
- Medium frequency: ~53-58 ns
- Negative lookups: ~47-56 ns
- Case insensitive: ~52-62 ns
- Mixed batch: ~480 ns for 8 words (~60 ns per word)
All benchmarks well under the 100 ns requirement.
## Binary Size
Estimated phf::Set binary size: ~200-220 KB
- 20,000 words × ~8 chars avg = ~160 KB string data
- phf perfect hash table overhead = ~40-60 KB
This is within the 250 KB CI gate requirement. Note: The exact binary size contribution is difficult to measure directly without analyzing the final linked binary, but the estimate is based on typical phf::Set characteristics.
## Files Changed
- `crates/pdftract-core/build.rs`: Added wordlist generation
- `crates/pdftract-core/build/wordlist-en-20k.txt`: Source wordlist
- `crates/pdftract-core/src/layout/wordlist.rs`: Wordlist module with API
- `crates/pdftract-core/src/layout/mod.rs`: Exported `is_english_word`
- `crates/pdftract-core/Cargo.toml`: Added wordlist benchmark
- `crates/pdftract-core/benches/wordlist.rs`: Performance benchmarks
## Git Commits
- (Will be created with this implementation)
## References
- Plan section: Phase 4.7 Word list (line 1787, 1805)
- Bead: pdftract-9wevc