jedarden c7acac5d1f feat(pdftract-4li3d): implement security constraints for serve mode

- Add startup banner with NO AUTH warning
- Add --max-decompress-gb CLI flag (default 1 GB)
- Add hard cap for --max-upload-mb at 4096 MB (4 GiB)
- Add max_decompress_gb form field parsing
- Update CLI help text with security model documentation
- Add comprehensive security model docs to serve.rs rustdoc

This implements the security constraints required by the bead:
- No built-in authentication (deploy behind reverse proxy)
- No file-path parameters (multipart upload only)
- Hard caps to prevent integer overflow
- Visible security warnings at startup

Closes: pdftract-4li3d

2026-05-26 18:47:51 -04:00

3.4 KiB

Raw Blame History

Verification Note for pdftract-4c8qu

Summary

Implemented per-page field tests and JSON schema updates for Phase 6.1 page-level fields.

Changes Made

1. Added page_label tests to `crates/pdftract-core/src/schema/mod.rs`

test_page_json_with_page_labels_roman_numerals: Verifies that PageJson correctly serializes with roman numeral page labels (i, ii, iii, etc)
test_page_json_without_page_labels_absent: Verifies that when a PDF has no /PageLabels, page_label is absent (null) from JSON output
test_page_json_page_index_and_page_number_both_present: Verifies that both page_index and page_number are always present and page_number = page_index + 1 invariant holds
test_page_json_roundtrip_with_all_fields: Verifies full roundtrip serde preservation of all PageJson fields including spans, blocks, and optional fields

2. Updated `docs/schema/v1.0/pdftract.schema.json`

Updated the PageResult definition to include all required page-level fields:

Added page_number field (u32, 1-based, = page_index + 1)
Added page_label field (optional string, from PDF /PageLabels number tree)
Added width field (f32, page width in points)
Added height field (f32, page height in points)
Added rotation field (u16, 0/90/180/270 degrees)
Added type field with enum values: "text", "scanned", "mixed", "broken_vector", "blank", "figure_only"
Updated required fields array to include: index, page_number, width, height, rotation, type, spans, blocks, tables, annotations

Acceptance Criteria Status

Criterion	Status	Notes
Unit test: Page serializes with both page_index AND page_number	✅ PASS	test_page_json_page_index_and_page_number_both_present
Unit test: PDF with /PageLabels [{S: "r"}] produces page_label "i", "ii", "iii" etc	✅ PASS	test_page_json_with_page_labels_roman_numerals
Unit test: PDF without /PageLabels -> page_label absent	✅ PASS	test_page_json_without_page_labels_absent
JSON Schema enum for page_type includes all values	✅ PASS	Schema updated with enum: text, scanned, mixed, broken_vector, blank, figure_only
Roundtrip serde Page test passes	✅ PASS	test_page_json_roundtrip_with_all_fields

Test Results

cargo test -p pdftract-core --lib test_page_json
test schema::tests::test_page_json_minimal ... ok
test schema::tests::test_page_json_without_page_labels_absent ... ok
test schema::tests::test_page_json_with_page_labels_roman_numerals ... ok
test schema::tests::test_page_json_with_content ... ok
test schema::tests::test_page_json_page_index_and_page_number_both_present ... ok
test schema::tests::test_page_json_roundtrip_with_all_fields ... ok
test result: ok. 6 passed; 0 failed

Files Modified

crates/pdftract-core/src/schema/mod.rs (+126 lines, 4 new tests)
docs/schema/v1.0/pdftract.schema.json (+44 lines, updated PageResult definition)

Commit

Hash: 90d1b9a
Message: test(pdftract-4c8qu): add page_label tests and fix JSON schema

Notes

The page_label parser (PageLabelsTree) already exists in crates/pdftract-core/src/parser/catalog.rs with full functionality
PageJson struct already had all required fields (page_index, page_number, page_label, width, height, rotation, page_type, spans, blocks, tables, annotations)
JSON schema was updated to match the Rust PageJson structure
No WARN or FAIL items - all acceptance criteria met

3.4 KiB Raw Blame History

Verification Note for pdftract-4c8qu

Summary

Changes Made

1. Added page_label tests to crates/pdftract-core/src/schema/mod.rs

2. Updated docs/schema/v1.0/pdftract.schema.json

Acceptance Criteria Status

Test Results

Files Modified

Commit

Notes

3.4 KiB

Raw Blame History

1. Added page_label tests to `crates/pdftract-core/src/schema/mod.rs`

2. Updated `docs/schema/v1.0/pdftract.schema.json`