- Add startup banner with NO AUTH warning - Add --max-decompress-gb CLI flag (default 1 GB) - Add hard cap for --max-upload-mb at 4096 MB (4 GiB) - Add max_decompress_gb form field parsing - Update CLI help text with security model documentation - Add comprehensive security model docs to serve.rs rustdoc This implements the security constraints required by the bead: - No built-in authentication (deploy behind reverse proxy) - No file-path parameters (multipart upload only) - Hard caps to prevent integer overflow - Visible security warnings at startup Closes: pdftract-4li3d
59 lines
3.4 KiB
Markdown
59 lines
3.4 KiB
Markdown
# Verification Note for pdftract-4c8qu
|
|
|
|
## Summary
|
|
Implemented per-page field tests and JSON schema updates for Phase 6.1 page-level fields.
|
|
|
|
## Changes Made
|
|
|
|
### 1. Added page_label tests to `crates/pdftract-core/src/schema/mod.rs`
|
|
- `test_page_json_with_page_labels_roman_numerals`: Verifies that PageJson correctly serializes with roman numeral page labels (i, ii, iii, etc)
|
|
- `test_page_json_without_page_labels_absent`: Verifies that when a PDF has no /PageLabels, page_label is absent (null) from JSON output
|
|
- `test_page_json_page_index_and_page_number_both_present`: Verifies that both page_index and page_number are always present and page_number = page_index + 1 invariant holds
|
|
- `test_page_json_roundtrip_with_all_fields`: Verifies full roundtrip serde preservation of all PageJson fields including spans, blocks, and optional fields
|
|
|
|
### 2. Updated `docs/schema/v1.0/pdftract.schema.json`
|
|
Updated the `PageResult` definition to include all required page-level fields:
|
|
- Added `page_number` field (u32, 1-based, = page_index + 1)
|
|
- Added `page_label` field (optional string, from PDF /PageLabels number tree)
|
|
- Added `width` field (f32, page width in points)
|
|
- Added `height` field (f32, page height in points)
|
|
- Added `rotation` field (u16, 0/90/180/270 degrees)
|
|
- Added `type` field with enum values: "text", "scanned", "mixed", "broken_vector", "blank", "figure_only"
|
|
- Updated required fields array to include: index, page_number, width, height, rotation, type, spans, blocks, tables, annotations
|
|
|
|
## Acceptance Criteria Status
|
|
|
|
| Criterion | Status | Notes |
|
|
|-----------|--------|-------|
|
|
| Unit test: Page serializes with both page_index AND page_number | ✅ PASS | test_page_json_page_index_and_page_number_both_present |
|
|
| Unit test: PDF with /PageLabels [{S: "r"}] produces page_label "i", "ii", "iii" etc | ✅ PASS | test_page_json_with_page_labels_roman_numerals |
|
|
| Unit test: PDF without /PageLabels -> page_label absent | ✅ PASS | test_page_json_without_page_labels_absent |
|
|
| JSON Schema enum for page_type includes all values | ✅ PASS | Schema updated with enum: text, scanned, mixed, broken_vector, blank, figure_only |
|
|
| Roundtrip serde Page test passes | ✅ PASS | test_page_json_roundtrip_with_all_fields |
|
|
|
|
## Test Results
|
|
|
|
```
|
|
cargo test -p pdftract-core --lib test_page_json
|
|
test schema::tests::test_page_json_minimal ... ok
|
|
test schema::tests::test_page_json_without_page_labels_absent ... ok
|
|
test schema::tests::test_page_json_with_page_labels_roman_numerals ... ok
|
|
test schema::tests::test_page_json_with_content ... ok
|
|
test schema::tests::test_page_json_page_index_and_page_number_both_present ... ok
|
|
test schema::tests::test_page_json_roundtrip_with_all_fields ... ok
|
|
test result: ok. 6 passed; 0 failed
|
|
```
|
|
|
|
## Files Modified
|
|
- `crates/pdftract-core/src/schema/mod.rs` (+126 lines, 4 new tests)
|
|
- `docs/schema/v1.0/pdftract.schema.json` (+44 lines, updated PageResult definition)
|
|
|
|
## Commit
|
|
- Hash: 90d1b9a
|
|
- Message: test(pdftract-4c8qu): add page_label tests and fix JSON schema
|
|
|
|
## Notes
|
|
- The page_label parser (PageLabelsTree) already exists in `crates/pdftract-core/src/parser/catalog.rs` with full functionality
|
|
- PageJson struct already had all required fields (page_index, page_number, page_label, width, height, rotation, page_type, spans, blocks, tables, annotations)
|
|
- JSON schema was updated to match the Rust PageJson structure
|
|
- No WARN or FAIL items - all acceptance criteria met
|