jedarden 9992eb98d4 feat(pdftract-6arz): implement signature metadata extraction

Implement Phase 7.3.2: resolve /V dictionaries and extract signature metadata
including signer name, signing date (parsed to ISO 8601), reason, location,
SubFilter, ByteRange, and coverage fraction.

Key changes:
- Add Signature struct with all metadata fields
- Add parse_pdf_date() for PDF date format to ISO 8601 conversion
- Add decode_pdf_string() for PDFDocEncoding/UTF-16BE string decoding
- Add extract_signature_metadata() and extract_signatures() public APIs
- Add 18 new unit tests (27 total tests, all PASS)

Acceptance criteria:
- Two signature fields: both extracted with correct signer names and dates
- Unsigned signature field: emitted with empty fields (value: null analog)
- /ByteRange coverage: correctly computed as fraction of file bytes
- Malformed date: returns None; missing /Name: returns ""; missing /ByteRange: returns None

Closes: pdftract-6arz

2026-05-24 03:42:50 -04:00

6 KiB

Raw Permalink Blame History

pdftract-6arz: Signature metadata extraction (/V dict + ByteRange coverage)

Summary

Implemented Phase 7.3.2: Digital signature metadata extraction. The implementation resolves /V dictionaries for each discovered signature field, extracts signer identity and timestamps, computes coverage statistics from /ByteRange, and produces a structured Signature output.

Changes Made

Added Signature struct

field_name: Absolute field name from AcroForm
signer_name: From /Name entry (defaults to "")
signing_date: Option<ISO 8601 string> parsed from PDF /M date format
reason: Option from /Reason
location: Option from /Location
sub_filter: Option from /SubFilter (signature format)
byte_range: Option<Vec> defining signed byte ranges
coverage_fraction: Option computed as (br[1] + br[3]) / file_size
validation_status: Hard-coded "not_checked" per plan (v1 has no crypto validation)

Added PDF date to ISO 8601 parser

parse_pdf_date() function handles PDF date format: D:YYYYMMDDHHmmSSOHH'mm
Tolerates truncated dates (date only, no time, no tz)
Outputs RFC 3339 ISO 8601 format with "Z" for UTC
Defaults missing components: 00 for time, Z for timezone

Added PDF string decoder

decode_pdf_string() handles UTF-16BE BOM, UTF-16BE without BOM (heuristic), and PDFDocEncoding
Copied from outline.rs (private there) to avoid coupling modules
Handles both PDFDocEncoding and UTF-16BE encoded strings

Added metadata extraction functions

extract_signature_metadata(): Extracts all fields from a single signature's /V dict
extract_signatures(): Public API for processing all discovered signature fields

Test coverage (27 tests, all PASS)

Discovery tests (9 tests from 7.3.1)

All existing discovery tests continue to pass

Metadata extraction tests (5 new tests)

test_extract_signature_metadata_full: Full signature with all fields
test_extract_signature_metadata_unsigned: Unsigned field (no /V)
test_extract_signature_metadata_missing_optional_fields: Minimal signature
test_extract_signatures_multiple: Two signatures with different /V dicts
test_walk_acroform_fields_reusable: Verifies walker returns all field types

PDF date parsing tests (7 new tests)

test_parse_pdf_date_full_with_timezone: D:20230115143045+05'30'
test_parse_pdf_date_utc: D:20230115143045Z
test_parse_pdf_date_negative_timezone: D:20230115143045-08'00'
test_parse_pdf_date_only: D:20230115 (date only, defaults to 00:00:00Z)
test_parse_pdf_date_no_timezone: D:20230115143045 (no tz, defaults to Z)
test_parse_pdf_date_without_d_prefix: 20230115143045Z
test_parse_pdf_date_malformed: Various malformed inputs return None

ByteRange coverage tests (4 new tests)

test_coverage_fraction_full_coverage: 4000/4000 bytes = 1.0
test_coverage_fraction_partial: 1500/3000 bytes = 0.5
test_coverage_fraction_no_file_size: None when file_size unknown
test_coverage_fraction_invalid_byte_range: None when /ByteRange malformed

PDF string decoding tests (3 new tests)

test_decode_pdf_string_ascii: ASCII string
test_decode_pdf_string_utf16be_bom: UTF-16BE with BOM
test_decode_pdf_string_empty: Empty string

Acceptance Criteria Status

✅ Critical test: PDF with two signature fields - both extracted with correct signer names and dates
✅ Critical test: unsigned signature field - emitted with value: null (modeled as unsigned Signature with empty fields)
✅ Critical test: /ByteRange coverage fraction computed correctly
✅ Unit tests: malformed date string (returns None), missing /Name (returns ""), missing /ByteRange (returns None coverage)
✅ Output: Signature struct with all required fields

Implementation Notes

Unsigned signature handling: When /V is absent, we return a Signature with:
- signer_name: ""
- signing_date: None
- reason: None
- location: None
- sub_filter: None
- byte_range: None
- coverage_fraction: None
- validation_status: "not_checked"
Date parsing: The PDF date format is complex and may include:
- Literal "D:" prefix
- Truncated values (date only, date+time only)
- Timezone as "Z", "+HH'mm'", "-HH'mm'", or omitted
- Our parser handles all these cases and outputs clean ISO 8601
Coverage computation: Per plan, coverage is (br[1] + br[3]) / file_size
- br[0] and br[2] are offsets, br[1] and br[3] are lengths
- The signature value itself is NOT covered (it's between the two ranges)
- Values < 1.0 indicate partial signatures (red flag for tampered docs)
String decoding: /Name, /Reason, and /Location are PDF strings that may use:
- PDFDocEncoding (Latin-1 with overrides)
- UTF-16BE with BOM (0xFE 0xFF)
- UTF-16BE without BOM (heuristic detection)
- Our decoder handles all three cases
SubFilter is a Name: Unlike other string fields, /SubFilter is a PDF Name object
- Read via as_name() instead of as_string()
- No decoding needed (Names are always ASCII identifiers)

Known Limitations

page_index: Still None (deferred from 7.3.1). Requires reverse lookup through page /Annots arrays.
value field: The actual signature value (PKCS#7 DER blob) is not extracted in v1.
- This would require resolving the /Contents entry and decoding the signature
- Deferred to future work when cryptographic validation is implemented
diagnostics not surfaced: Extraction failures (malformed /V, unresolvable references) return default/empty values rather than surfacing diagnostics. This is acceptable for v1 but may need improvement for production use.

Git Commit

Commit: TBD
Message: feat(pdftract-6arz): implement signature metadata extraction
Files changed: crates/pdftract-core/src/signature/mod.rs (+835 lines)

Next Steps

pdftract-j6yd (7.3.3): signatures array output + validation_status enum + schema integration
Future: Cryptographic validation (ring/openssl integration)

6 KiB Raw Permalink Blame History