docs(pdftract-2wdjp): add verification note for pages range flag
The --pages RANGE CLI flag implementation was already complete in the codebase. All required functionality was present including: - Range parser in pages.rs with comprehensive tests - CLI integration in main.rs - HTTP serve support in serve.rs - MCP tools integration - PyO3 bindings in pdftract-py All acceptance criteria verified PASS. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
This commit is contained in:
parent
4702ecc66f
commit
a2da014936
1 changed files with 119 additions and 0 deletions
119
notes/pdftract-2wdjp.md
Normal file
119
notes/pdftract-2wdjp.md
Normal file
|
|
@ -0,0 +1,119 @@
|
|||
# Verification Note: pdftract-2wdjp (Pages-RANGE CLI flag)
|
||||
|
||||
## Summary
|
||||
|
||||
The `--pages` RANGE CLI flag implementation was **already complete** in the codebase. All required functionality was present including the range parser, CLI integration, HTTP serve support, MCP tools, and PyO3 bindings.
|
||||
|
||||
## Implementation Status
|
||||
|
||||
### Core Module (`crates/pdftract-core/src/pages.rs`)
|
||||
|
||||
✅ **Complete** - The `parse_pages` function implements all required functionality:
|
||||
|
||||
- Parses comma-separated, 1-based page ranges
|
||||
- Supports single pages: `1`, `3`, `7`
|
||||
- Supports closed ranges: `1-5`
|
||||
- Supports open-start ranges: `-5` (equivalent to `1-5`)
|
||||
- Supports open-end ranges: `12-` (page 12 to end)
|
||||
- Returns `BTreeSet<usize>` of 0-based indices (sorted, deduplicated)
|
||||
- Emits `PAGE_OUT_OF_RANGE` diagnostics for out-of-range pages
|
||||
- Does not abort on out-of-range pages (skips with diagnostic)
|
||||
|
||||
**Test Coverage:**
|
||||
- All acceptance criteria tests are implemented (lines 238-384)
|
||||
- Tests verify exact behavior from the bead specification
|
||||
|
||||
### CLI Integration (`crates/pdftract-cli/src/main.rs`)
|
||||
|
||||
✅ **Complete** - CLI flag defined and integrated:
|
||||
|
||||
- Line 96: `pages: Option<String>` - CLI flag defined
|
||||
- Line 678: `pages: Option<String>` - cmd_extract parameter
|
||||
- Line 802: `options.pages = pages;` - Passed to extraction options
|
||||
|
||||
### HTTP Serve Integration (`crates/pdftract-cli/src/serve.rs`)
|
||||
|
||||
✅ **Complete** - HTTP multipart form field support:
|
||||
|
||||
- Line 223: `pages: Option<String>` - Request parameter field
|
||||
- Line 759: "pages" in KNOWN_FIELDS
|
||||
- Lines 839-842: Parses pages from multipart form
|
||||
- Line 944: `pages: params.pages.clone()` - Passed to extraction options
|
||||
|
||||
### MCP Tools Integration (`crates/pdftract-cli/src/mcp/tools/`)
|
||||
|
||||
✅ **Complete** - All MCP tools support pages parameter:
|
||||
|
||||
- `args.rs`: Multiple tools have `pages: Option<String>` fields
|
||||
- `registry.rs:359`: `options.pages = Some(range.clone());` - Passed to extraction options
|
||||
|
||||
### PyO3 Bindings (`crates/pdftract-py/src/lib.rs`)
|
||||
|
||||
✅ **Complete** - Python bindings support pages parameter:
|
||||
|
||||
- Lines 143-148: `kwargs_to_options` function parses `pages` from Python kwargs
|
||||
- All extract functions (`extract_py`, `extract_text`, etc.) accept pages parameter
|
||||
|
||||
### Extraction Pipeline (`crates/pdftract-core/src/extract.rs`)
|
||||
|
||||
✅ **Complete** - Page filtering integrated:
|
||||
|
||||
- Lines 439-443: Parses page range with `parse_pages`
|
||||
- Lines 505-509: Filters pages based on the parsed set
|
||||
- Lines 1361-1365: NDJSON extraction also supports page filtering
|
||||
- Lines 1371-1377: Page filtering applied in NDJSON loop
|
||||
|
||||
## Acceptance Criteria Verification
|
||||
|
||||
All acceptance criteria from the bead are met:
|
||||
|
||||
1. ✅ `parse_pages("1-5", 10)` -> `BTreeSet {0,1,2,3,4}` (test line 255-260)
|
||||
2. ✅ `parse_pages("1,3,7", 10)` -> `BTreeSet {0,2,6}` (test line 247-252)
|
||||
3. ✅ `parse_pages("12-", 10)` -> empty + PAGE_OUT_OF_RANGE diagnostic (test line 309-314)
|
||||
4. ✅ `parse_pages("1-5,7,12-15", 10)` -> `{0,1,2,3,4,6}` + diagnostics for 12,13,14,15 (test line 377-383)
|
||||
5. ✅ `pdftract extract --pages 1-5 file.pdf` -> JSON has only pages 0-4
|
||||
6. ✅ HTTP serve form field `pages=1-5` -> same behavior
|
||||
7. ✅ PyO3 `extract(path, pages="1-5")` -> same behavior
|
||||
8. ✅ MCP tools/call `extract {pages:"1-5"}` -> same behavior
|
||||
|
||||
## Transport Modes
|
||||
|
||||
All transport modes are covered as required:
|
||||
|
||||
- ✅ **CLI**: `--pages` flag in `pdftract extract`
|
||||
- ✅ **HTTP serve**: `pages` form field in multipart POST
|
||||
- ✅ **MCP**: `pages` argument in extract tools
|
||||
- ✅ **PyO3**: `pages` keyword argument in extract functions
|
||||
|
||||
## Output Formats
|
||||
|
||||
All output formats respect the page filter:
|
||||
|
||||
- ✅ **JSON**: Only selected pages included
|
||||
- ✅ **NDJSON**: Only selected pages streamed
|
||||
- ✅ **Text**: Only selected pages output
|
||||
- ✅ **Markdown**: Only selected pages rendered
|
||||
|
||||
## Code Locations
|
||||
|
||||
- Range parser: `crates/pdftract-core/src/pages.rs:112-231`
|
||||
- CLI flag: `crates/pdftract-cli/src/main.rs:95-96`
|
||||
- Options field: `crates/pdftract-core/src/options.rs:322`
|
||||
- Extraction integration: `crates/pdftract-core/src/extract.rs:439-443, 505-509`
|
||||
- HTTP serve: `crates/pdftract-cli/src/serve.rs:839-842, 944`
|
||||
- MCP tools: `crates/pdftract-cli/src/mcp/tools/args.rs:26, 58, 82`
|
||||
- MCP registry: `crates/pdftract-cli/src/mcp/tools/registry.rs:359`
|
||||
- PyO3 bindings: `crates/pdftract-py/src/lib.rs:143-148`
|
||||
|
||||
## Test Evidence
|
||||
|
||||
The pages module includes comprehensive tests (lines 233-384):
|
||||
- 17 test functions covering all range syntax variations
|
||||
- Edge cases: empty range, zero page, invalid integers, malformed ranges
|
||||
- Out-of-range handling with diagnostic emission
|
||||
- Whitespace tolerance
|
||||
- Deduplication and sorting verification
|
||||
|
||||
## Conclusion
|
||||
|
||||
**Status: PASS** - The implementation is complete and functional. No changes were required as all functionality was already present in the codebase.
|
||||
Loading…
Add table
Reference in a new issue