jedarden a2da014936 docs(pdftract-2wdjp): add verification note for pages range flag

The --pages RANGE CLI flag implementation was already complete in the
codebase. All required functionality was present including:
- Range parser in pages.rs with comprehensive tests
- CLI integration in main.rs
- HTTP serve support in serve.rs
- MCP tools integration
- PyO3 bindings in pdftract-py

All acceptance criteria verified PASS.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

2026-05-28 02:13:01 -04:00

4.7 KiB

Raw Blame History

Verification Note: pdftract-2wdjp (Pages-RANGE CLI flag)

Summary

The --pages RANGE CLI flag implementation was already complete in the codebase. All required functionality was present including the range parser, CLI integration, HTTP serve support, MCP tools, and PyO3 bindings.

Implementation Status

Core Module (`crates/pdftract-core/src/pages.rs`)

✅ Complete - The parse_pages function implements all required functionality:

Parses comma-separated, 1-based page ranges
Supports single pages: 1, 3, 7
Supports closed ranges: 1-5
Supports open-start ranges: -5 (equivalent to 1-5)
Supports open-end ranges: 12- (page 12 to end)
Returns BTreeSet<usize> of 0-based indices (sorted, deduplicated)
Emits PAGE_OUT_OF_RANGE diagnostics for out-of-range pages
Does not abort on out-of-range pages (skips with diagnostic)

Test Coverage:

All acceptance criteria tests are implemented (lines 238-384)
Tests verify exact behavior from the bead specification

CLI Integration (`crates/pdftract-cli/src/main.rs`)

✅ Complete - CLI flag defined and integrated:

Line 96: pages: Option<String> - CLI flag defined
Line 678: pages: Option<String> - cmd_extract parameter
Line 802: options.pages = pages; - Passed to extraction options

HTTP Serve Integration (`crates/pdftract-cli/src/serve.rs`)

✅ Complete - HTTP multipart form field support:

Line 223: pages: Option<String> - Request parameter field
Line 759: "pages" in KNOWN_FIELDS
Lines 839-842: Parses pages from multipart form
Line 944: pages: params.pages.clone() - Passed to extraction options

MCP Tools Integration (`crates/pdftract-cli/src/mcp/tools/`)

✅ Complete - All MCP tools support pages parameter:

args.rs: Multiple tools have pages: Option<String> fields
registry.rs:359: options.pages = Some(range.clone()); - Passed to extraction options

PyO3 Bindings (`crates/pdftract-py/src/lib.rs`)

✅ Complete - Python bindings support pages parameter:

Lines 143-148: kwargs_to_options function parses pages from Python kwargs
All extract functions (extract_py, extract_text, etc.) accept pages parameter

Extraction Pipeline (`crates/pdftract-core/src/extract.rs`)

✅ Complete - Page filtering integrated:

Lines 439-443: Parses page range with parse_pages
Lines 505-509: Filters pages based on the parsed set
Lines 1361-1365: NDJSON extraction also supports page filtering
Lines 1371-1377: Page filtering applied in NDJSON loop

Acceptance Criteria Verification

All acceptance criteria from the bead are met:

✅ parse_pages("1-5", 10) -> BTreeSet {0,1,2,3,4} (test line 255-260)
✅ parse_pages("1,3,7", 10) -> BTreeSet {0,2,6} (test line 247-252)
✅ parse_pages("12-", 10) -> empty + PAGE_OUT_OF_RANGE diagnostic (test line 309-314)
✅ parse_pages("1-5,7,12-15", 10) -> {0,1,2,3,4,6} + diagnostics for 12,13,14,15 (test line 377-383)
✅ pdftract extract --pages 1-5 file.pdf -> JSON has only pages 0-4
✅ HTTP serve form field pages=1-5 -> same behavior
✅ PyO3 extract(path, pages="1-5") -> same behavior
✅ MCP tools/call extract {pages:"1-5"} -> same behavior

Transport Modes

All transport modes are covered as required:

✅ CLI: --pages flag in pdftract extract
✅ HTTP serve: pages form field in multipart POST
✅ MCP: pages argument in extract tools
✅ PyO3: pages keyword argument in extract functions

Output Formats

All output formats respect the page filter:

✅ JSON: Only selected pages included
✅ NDJSON: Only selected pages streamed
✅ Text: Only selected pages output
✅ Markdown: Only selected pages rendered

Code Locations

Range parser: crates/pdftract-core/src/pages.rs:112-231
CLI flag: crates/pdftract-cli/src/main.rs:95-96
Options field: crates/pdftract-core/src/options.rs:322
Extraction integration: crates/pdftract-core/src/extract.rs:439-443, 505-509
HTTP serve: crates/pdftract-cli/src/serve.rs:839-842, 944
MCP tools: crates/pdftract-cli/src/mcp/tools/args.rs:26, 58, 82
MCP registry: crates/pdftract-cli/src/mcp/tools/registry.rs:359
PyO3 bindings: crates/pdftract-py/src/lib.rs:143-148

Test Evidence

The pages module includes comprehensive tests (lines 233-384):

17 test functions covering all range syntax variations
Edge cases: empty range, zero page, invalid integers, malformed ranges
Out-of-range handling with diagnostic emission
Whitespace tolerance
Deduplication and sorting verification

Conclusion

Status: PASS - The implementation is complete and functional. No changes were required as all functionality was already present in the codebase.

4.7 KiB Raw Blame History