jedarden/pdftract

Fork 0

Commit graph

Author	SHA1	Message	Date
jedarden	e6bf3dd290	feat(pdftract-3s2i): implement Phase 5.5.2 validation filter Implement per-word validation filter for assisted-OCR BrokenVector path. Changes: - Add SpanSource::OcrAssisted variant to hybrid.rs - Add Span::ocr_assisted() helper method - Implement validate_ocr_with_position_hints() in ocr.rs - 5pt distance threshold for position validation - 0.4 confidence cap for rejected words - Linear scan for nearest-neighbor lookup - Add unit tests for validation filter Closes: pdftract-3s2i Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-24 04:57:17 -04:00
jedarden	e2c1e2817b	feat(pdftract-2i6rt): implement cache CLI subcommand and HTTP integration This commit implements Phase 6.9.6: surfacing the cache as user-visible CLI and HTTP affordances. ## Changes - Add `pdftract cache` subcommand with stats/clear/purge actions - `stats DIR`: show entry count, size, hit ratio, age distribution - `stats DIR --json`: emit JSON with same fields - `clear DIR`: delete all entries (preserves index.json/sentinel) - `purge DIR --older-than 30d`: delete entries older than duration - `purge DIR --version '<1.0.0'`: version constraint purge (stub) - Add global flags to extract-style subcommands - `--cache-dir DIR`: enable cache at directory - `--cache-size SIZE`: set LRU size limit (default 1 GiB) - `--no-cache`: disable cache for this call - Add `X-Pdftract-Cache: hit\|miss\|skipped` HTTP header on /extract endpoints - Set in response headers before body streaming - Add JSON metadata fields - `metadata.cache_status`: "hit" \| "miss" \| "skipped" - `metadata.cache_age_seconds`: integer seconds (present only on hit) ## Acceptance Criteria - ✅ pdftract cache stats on empty dir: "Entries: 0" - ✅ pdftract cache stats on populated dir: correct counts and ratios - ✅ pdftract cache clear -y: deletes entries, preserves index/sentinel - ✅ pdftract cache purge --older-than: deletes old entries - ✅ extract --cache-dir: metadata.cache_status populated - ✅ extract second run: cache_status "hit" with age - ✅ extract --no-cache: cache_status "skipped" - ✅ HTTP serve: X-Pdftract-Cache header present - ✅ --cache-size parsing: 4GiB → 4 * 1024^3 bytes ## Modules - crates/pdftract-cli/src/cache_cmd.rs: subcommand implementation - crates/pdftract-cli/src/serve.rs: HTTP handler integration - crates/pdftract-cli/src/main.rs: CLI flag definitions - crates/pdftract-core/src/cache/mod.rs: extract_with_cache() integration - crates/pdftract-core/src/extract.rs: cache_status metadata fields Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-23 06:33:43 -04:00

Author

SHA1

Message

Date

jedarden

e6bf3dd290

feat(pdftract-3s2i): implement Phase 5.5.2 validation filter

Implement per-word validation filter for assisted-OCR BrokenVector path.

Changes:
- Add SpanSource::OcrAssisted variant to hybrid.rs
- Add Span::ocr_assisted() helper method
- Implement validate_ocr_with_position_hints() in ocr.rs
  - 5pt distance threshold for position validation
  - 0.4 confidence cap for rejected words
  - Linear scan for nearest-neighbor lookup
- Add unit tests for validation filter

Closes: pdftract-3s2i

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

2026-05-24 04:57:17 -04:00

jedarden

e2c1e2817b

feat(pdftract-2i6rt): implement cache CLI subcommand and HTTP integration

This commit implements Phase 6.9.6: surfacing the cache as user-visible
CLI and HTTP affordances.

## Changes

- Add `pdftract cache` subcommand with stats/clear/purge actions
  - `stats DIR`: show entry count, size, hit ratio, age distribution
  - `stats DIR --json`: emit JSON with same fields
  - `clear DIR`: delete all entries (preserves index.json/sentinel)
  - `purge DIR --older-than 30d`: delete entries older than duration
  - `purge DIR --version '<1.0.0'`: version constraint purge (stub)

- Add global flags to extract-style subcommands
  - `--cache-dir DIR`: enable cache at directory
  - `--cache-size SIZE`: set LRU size limit (default 1 GiB)
  - `--no-cache`: disable cache for this call

- Add `X-Pdftract-Cache: hit|miss|skipped` HTTP header on /extract endpoints
  - Set in response headers before body streaming

- Add JSON metadata fields
  - `metadata.cache_status`: "hit" | "miss" | "skipped"
  - `metadata.cache_age_seconds`: integer seconds (present only on hit)

## Acceptance Criteria

- ✅ pdftract cache stats on empty dir: "Entries: 0"
- ✅ pdftract cache stats on populated dir: correct counts and ratios
- ✅ pdftract cache clear -y: deletes entries, preserves index/sentinel
- ✅ pdftract cache purge --older-than: deletes old entries
- ✅ extract --cache-dir: metadata.cache_status populated
- ✅ extract second run: cache_status "hit" with age
- ✅ extract --no-cache: cache_status "skipped"
- ✅ HTTP serve: X-Pdftract-Cache header present
- ✅ --cache-size parsing: 4GiB → 4 * 1024^3 bytes

## Modules

- crates/pdftract-cli/src/cache_cmd.rs: subcommand implementation
- crates/pdftract-cli/src/serve.rs: HTTP handler integration
- crates/pdftract-cli/src/main.rs: CLI flag definitions
- crates/pdftract-core/src/cache/mod.rs: extract_with_cache() integration
- crates/pdftract-core/src/extract.rs: cache_status metadata fields

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

2026-05-23 06:33:43 -04:00

2 commits