pdftract

History

jedarden 90d1b9a83d test(pdftract-4c8qu): add page_label tests and fix JSON schema - Add test_page_json_with_page_labels_roman_numerals: verifies page_label serialization with roman numeral values (i, ii, iii, etc) - Add test_page_json_without_page_labels_absent: verifies page_label is absent (null) when PDF has no /PageLabels - Add test_page_json_page_index_and_page_number_both_present: verifies both page_index and page_number are always present and page_number = page_index + 1 - Add test_page_json_roundtrip_with_all_fields: verifies full roundtrip serde preservation of all PageJson fields - Update docs/schema/v1.0/pdftract.schema.json PageResult definition: - Add page_number field (1-based, = page_index + 1) - Add page_label field (optional, from /PageLabels number tree) - Add width and height fields (page geometry in points) - Add rotation field (0, 90, 180, 270 degrees) - Add type field with enum: text, scanned, mixed, broken_vector, blank, figure_only - Update required fields to include all page-level fields Acceptance criteria: ✅ Page serializes with both page_index AND page_number ✅ PDF with /PageLabels [{S: "r"}] produces page_label "i", "ii", "iii" etc ✅ PDF without /PageLabels -> page_label absent ✅ JSON Schema enum for page_type includes all values ✅ Roundtrip serde Page test passes Closes: pdftract-4c8qu	2026-05-25 14:43:31 -04:00
..
grep-jsonl.schema.json	feat(pdftract-5ls35): implement JSON-Lines output sink for grep	2026-05-25 02:05:17 -04:00
pdftract.schema.json	test(pdftract-4c8qu): add page_label tests and fix JSON schema	2026-05-25 14:43:31 -04:00

jedarden 90d1b9a83d test(pdftract-4c8qu): add page_label tests and fix JSON schema

- Add test_page_json_with_page_labels_roman_numerals: verifies page_label
  serialization with roman numeral values (i, ii, iii, etc)
- Add test_page_json_without_page_labels_absent: verifies page_label is
  absent (null) when PDF has no /PageLabels
- Add test_page_json_page_index_and_page_number_both_present: verifies
  both page_index and page_number are always present and page_number = page_index + 1
- Add test_page_json_roundtrip_with_all_fields: verifies full roundtrip
  serde preservation of all PageJson fields

- Update docs/schema/v1.0/pdftract.schema.json PageResult definition:
  - Add page_number field (1-based, = page_index + 1)
  - Add page_label field (optional, from /PageLabels number tree)
  - Add width and height fields (page geometry in points)
  - Add rotation field (0, 90, 180, 270 degrees)
  - Add type field with enum: text, scanned, mixed, broken_vector, blank, figure_only
  - Update required fields to include all page-level fields

Acceptance criteria:
✅ Page serializes with both page_index AND page_number
✅ PDF with /PageLabels [{S: "r"}] produces page_label "i", "ii", "iii" etc
✅ PDF without /PageLabels -> page_label absent
✅ JSON Schema enum for page_type includes all values
✅ Roundtrip serde Page test passes

Closes: pdftract-4c8qu

2026-05-25 14:43:31 -04:00

grep-jsonl.schema.json

feat(pdftract-5ls35): implement JSON-Lines output sink for grep

2026-05-25 02:05:17 -04:00

pdftract.schema.json

test(pdftract-4c8qu): add page_label tests and fix JSON schema

2026-05-25 14:43:31 -04:00