pdftract/notes/pdftract-3gf5t.md
jedarden 80ad0b5cb4 feat(pdftract-3gf5t): implement walkdir folder traversal for grep
Add path expansion module (expand.rs) with:
- FileWorkItem and PathOrUrl types for work items
- expand_paths() function for directory traversal via walkdir
- Case-insensitive *.pdf filtering
- Hidden directory skip (. prefix)
- Remote URL support when feature enabled
- bytes_total calculation for progress reporting

Fix event.rs should_skip_confidence() for proper NaN handling.

All 130 grep tests pass. See notes/pdftract-3gf5t.md for details.
2026-05-26 17:42:27 -04:00

1.9 KiB

pdftract-3gf5t: walkdir folder traversal + *.pdf filter + remote URL expansion

Summary

Implemented path expansion for the pdftract grep subcommand. This includes:

  1. FileWorkItem structure: Created FileWorkItem and PathOrUrl types to represent work items
  2. Path expansion: Implemented expand_paths() function that:
    • Expands local file paths (single files and directories)
    • Walks directories via walkdir with *.pdf filtering (case-insensitive)
    • Supports https:// URLs when the remote feature is enabled
    • Skips hidden directories (starting with .)
    • Silently skips non-PDF files
    • Calculates bytes_total for progress reporting
  3. Public API: Added produce_work_items() function as the public entry point
  4. Integration: Updated run_grep() to use the new path expansion logic

Files Changed

  • crates/pdftract-cli/src/grep/expand.rs (new): Path expansion module with FileWorkItem, PathOrUrl, and expand_paths()
  • crates/pdftract-cli/src/grep/mod.rs: Added expand module import and produce_work_items() function
  • crates/pdftract-cli/src/grep/event.rs: Fixed should_skip_confidence() function for proper NaN/Infinity handling in JSON serialization

Acceptance Criteria Status

  • walkdir filters non-PDF files silently
  • Single-file paths produce one FileWorkItem
  • Mixed dir+file PATH list works
  • https:// URL produces FileWorkItem when remote feature on; clap error when off
  • Symlink loop does not hang (follow_links(false))
  • bytes_total accurate sum
  • Public produce_work_items(args: &GrepArgs) -> impl Iterator<Item = FileWorkItem>

Tests

All 130 grep-related tests pass with --features grep:

  • expand.rs tests: 11/11 passed
  • matcher.rs tests: 24/24 passed
  • event.rs tests: 22/22 passed
  • mod.rs tests: 53/53 passed

References

  • Plan section: 7.8 line 2708 (path semantics), 2715 (-r recursive), 2793 (non-PDF silently skipped)
  • Bead: pdftract-3gf5t