Implement TH-07 security test validating that PDF password ingress
channels properly prevent password disclosure via process arg list.
Test cases:
- --password VALUE rejected with exit 64 without opt-in
- --password VALUE with PDFTRACT_INSECURE_CLI_PASSWORD=1 proceeds with warning
- --password-stdin works correctly
- PDFTRACT_PASSWORD env var works correctly
- Password leaks in /proc/<pid>/cmdline under opt-in (proving the vulnerability)
- Password does NOT leak with --password-stdin or env var
Closes: pdftract-43jxa
Add comprehensive security test suite for TH-03 (plan line 874) verifying
MCP server requires authentication on non-loopback binds.
Test coverage:
- IPv4/IPv6 all-addresses bind requires token (exit 78)
- Loopback addresses (127.0.0.1, ::1, localhost) exempt from auth
- Token auth via PDFTRACT_MCP_TOKEN env var and --auth-token-file
- Atomic failure verification (no listener during failure window)
- Exit code specificity (EX_CONFIG=78, not just any non-zero)
- Parallel bind attempts all fail securely
File: crates/pdftract-core/tests/TH-03-mcp-no-auth.rs (529 lines, 11 tests)
Verification note: notes/pdftract-5m3hp.md
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Implement Tf, Td, TD, Tm, T* operators for Phase 3.1 text state.
- Add TSTAR_ZERO_LEADING, FONT_RESOURCE_NOT_FOUND, FONT_SIZE_ZERO_OR_NEGATIVE diagnostics
- Add move_text, move_text_set_leading, set_text_matrix, next_line, set_font methods to GraphicsState
- Refactor execute_with_do to use gstate.text_matrix instead of local TextMatrix
- Implement Tf with ResourceStack font resolution and size clamping
- Implement Td/TD/Tm/T* operators with correct matrix semantics
- Add acceptance criteria tests for all operators
Per PDF spec:
- Td: text_line_matrix = translate(tx, ty) * text_line_matrix
- TD: same as Td, plus sets leading = -ty
- Tm: overwrites both text_matrix and text_line_matrix (does not accumulate)
- T*: equivalent to Td 0 -leading
- Tf: resolves font name from ResourceStack, clamps size <= 0 to 1.0
Closes: pdftract-4x0y
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Implement the xtask gen-schema binary at xtask/src/bin/gen_schema.rs that
derives JSON Schema Draft 2020-12 from the Rust ExtractionResult type via
the schemars crate.
Changes:
- Add stable key sorting (sort_keys_recursive) for byte-identical output
- Set $id to stable URL: https://pdftract.com/schema/v1.0/pdftract.schema.json
- Set title to "pdftract Output v1.0"
- Add cargo alias `gen-schema` for convenient invocation
- Emit schema to docs/schema/v1.0/pdftract.schema.json
The schema is generated from the Rust types with schemars derives, ensuring
the JSON schema is always in sync with the source types.
Acceptance criteria:
- cargo gen-schema regenerates docs/schema/v1.0/pdftract.schema.json
- Generated schema validates against JSON Schema Draft 2020-12
- Schema $id is the stable URL
- Title is "pdftract Output v1.0"
- Stable ordering: regenerating twice produces byte-identical output
- All expected types appear in $defs (BlockJson, SpanJson, PageResult, etc.)
Note: page_type and confidence_source enums are not yet implemented in the
Rust types (marked as TODO in schema/mod.rs). These will be added by sibling
beads pdftract-1ob and pdftract-1f8we respectively.
Closes: pdftract-5nv9h
Implement repair_hyphenation() that detects and repairs end-of-line
hyphenation within blocks. Joins hyphenated words across line breaks
when the hyphen is at the column right edge and the continuation
starts with a lowercase letter.
Key features:
- Detects hyphens: -, ‐ (U+2010), ‑ (U+2011), soft hyphen (U+00AD)
- Right-edge detection: span bbox.x1 within 5% of column width
- Lowercase continuation check to avoid joining sentences
- Column-aware: only joins spans in same column
- Cleans up empty spans/lines after repair
Adds HasBBox and HyphenableSpan traits for flexible span types.
Includes 9 comprehensive tests covering all acceptance criteria.
Fixes pre-existing test cases in schema module (missing column field).
Closes: pdftract-5o6hx
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Implemented Phase 7.6.3: extract non-link annotations with subtype-specific
fields including:
- TextMarkup (Highlight/Squiggly/StrikeOut/Underline) with /QuadPoints
- Stamp with /Name icon
- FreeText with /DA default appearance
- Text (sticky notes) with /Open, /State, /StateModel
- Ink with /InkList stroke paths
- Line with /L endpoints
- Polygon/PolyLine with /Vertices
- FileAttachment with /FS filespec reference
- Other (Circle, Square, Caret, Redact, etc.) with no extra fields
Added AnnotationSpecific enum to capture subtype-specific extras while
preserving the stable AnnotationCommon struct. Unknown subtypes emit
as Other without diagnostics (future: emit unhandled_annotation_subtype).
Comprehensive unit tests for all subtypes including edge cases.
Fixed pre-existing borrow issue in content_stream.rs.
Closes: pdftract-3r77
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Documents the creation of pdftract-sdk-node-publish.yaml, npm-token ExternalSecret,
and the cascade enablement. WARN: npm token and SDK repo must be created before
first publish run.
Bead: pdftract-62x5c
- Add TAGGED_PDF_STRUCT_TREE_DEFERRED diagnostic emission for tagged PDFs
- Set reading_order_algorithm to xy_cut for all PDFs in v0.1.0-v0.3.0
- Diagnostic emitted once per document (not per page)
- Add tests for tagged and untagged PDF behavior
- Phase 7.1 will replace with real StructTree traversal
Closes: pdftract-5tvv1
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Add Phase 4.7 BrokenVector escalation: when a page classified as Vector
has readability score < 0.5, escalate to BrokenVector and route to Phase 5.5 OCR.
Changes:
- Add PageClass::can_escalate_to_broken_vector() method
- Add apply_broken_vector_escalation() function with cfg(ocr) gating
- Add 13 comprehensive tests covering all escalation scenarios
Closes: pdftract-5v1l9
Implement the q (push) and Q (pop) operators driving a Vec<GraphicsState>
save stack with the PDF spec's 64-level depth limit.
Changes:
- Changed MAX_GSTATE_DEPTH from 32 to 64 per PDF spec section 8.4
- Added gstate_overflow_logged flag to emit overflow diagnostic only once per page
- Q at depth 0 is a no-op that emits GSTATE_STACK_UNDERFLOW diagnostic
Acceptance criteria (all PASS):
- 64 nested q calls succeed; 65th emits diagnostic
- 64 q + 64 Q restores to initial state
- Q at depth 0 is a no-op (no panic)
- 1000 paired q...Q operations succeed (depth never exceeds 1)
- Diagnostic emitted exactly once per page even after multiple overflows
Closes: pdftract-1os1
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Phase 7.6.2: Enhanced link annotation extraction for URI hyperlinks and
internal destination links. Added support for explicit destination arrays,
named destination resolution via /Catalog /Dests and /Catalog /Names /Dests
name trees, JavaScript action diagnostics, and link-without-target handling.
Key changes:
- Added FitType enum with all 8 PDF fit types (XYZ, Fit, FitH, FitV, FitR, FitB, FitBH, FitBV)
- Added DestArray struct for explicit destinations with page_index and fit fields
- Enhanced LinkAnnotation with dest_array field for explicit destinations
- Implemented name tree walking for /Catalog /Names /Dests resolution
- Added JavaScript action handling with diagnostic truncation (>100 chars)
- Added link-without-target diagnostic when /A and /Dest are both absent
- Updated dispatch_annotations signature to pass dests_dict and names_dests_ref
Acceptance criteria:
- Critical test: 5 URI hyperlinks appear in document links (link annotation emitted)
- Critical test: Named destination /Dest /SectionTwo -> dest: "SectionTwo"
- Unit tests: Explicit /Dest array (XYZ fit), /Dest as string-name, /JavaScript action
- Unit tests: Missing target diagnostic, all FitType variants
- Public Link { uri, dest, dest_array, page_index, rect } emitted per link
- /Dest resolution falls back gracefully when unresolved
Closes: pdftract-4zcj
- Add ResourceStack for nested resource scope management
- Add ExecutionContext for cycle/depth detection in form XObject recursion
- Add execute_with_do() function with full graphics state support (q/Q/cm/Do)
- Add ImageXObject type for recording encountered images
- Add comprehensive tests for ResourceStack, ExecutionContext, and Do operator
Per Phase 3.3 (plan.md:1579-1593):
- Form XObject lookup via ResourceStack
- /Matrix application to CTM
- Cycle detection (STRUCT_XOBJECT_CYCLE)
- Depth limiting (STRUCT_DEPTH_EXCEEDED, max 20)
- Image XObject recording without glyph production
Acceptance criteria:
- ResourceStack shadowing: form resources shadow parent resources
- Cycle detection: duplicate XObject ID triggers STRUCT_XOBJECT_CYCLE
- Depth limit: 20-level max, triggers STRUCT_DEPTH_EXCEEDED
- Image XObjects: recorded with CTM-transformed bbox, no glyphs
Closes: pdftract-62uon
Implement Phase 7.6.1: Per-page /Annots walker + subtype dispatch.
Creates the annotation module with:
- AnnotationCommon struct with shared fields (subtype, rect, contents,
author, modified date, color, opacity, flags, name_id, subject)
- dispatch_annotations() function that walks /Annots arrays and
dispatches by /Subtype:
- /Link → link extractor (7.6.2 placeholder)
- /Widget → skipped (handled by forms 7.4)
- /Popup → skipped (companion subtype)
- Others → annotation extractor (7.6.3 placeholder)
- PDF date parser (D:YYYYMMDDHHmmSSOHH'mm' → ISO 8601)
- Dereference loop detection via visited set
Acceptance criteria PASS:
- Unit tests for mixed annotation subtypes
- AnnotationCommon decoding for all non-skipped annotations
- Date parsing with ISO 8601 output
- Empty /Annots handling without diagnostics
- Public API returns (Vec<LinkAnnotation>, Vec<Annotation>)
Closes: pdftract-46qa
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Add 9 built-in classification profile definitions as YAML files bundled
via include_str! for the document type classifier (Phase 5.6).
- Create profiles/builtin/classification/{invoice,receipt,contract,scientific_paper,slide_deck,form,bank_statement,legal_filing,book_chapter}.yaml
- Implement load_builtins() in profiles module with profiles feature gate
- Each profile uses MatchPredicate schema with text patterns, structural signals, page counts
- Add comprehensive unit tests for profile loading and feature gate
Closes: pdftract-5sdd
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Phase 7.4.5 implementation: Wire combined Vec<(String, FormFieldValue)> from
combiner into document-level /form_fields JSON output with tagged union schema.
- Add FormFieldJson, FormFieldTypeJson, FormFieldValueJson, ChoiceValueJson to schema
- Add form_fields: Vec<FormFieldJson> to ExtractionResult (always emitted, empty when none)
- Implement acro_field_to_value() converter for Phase 7.4.2 type-specific extraction
- Wire form field extraction in extract_pdf(): walk AcroForm, extract XFA, combine with XFA-wins
- Add convert_form_field_to_json() helper for FormFieldValue → FormFieldJson conversion
- Update docs/schema/v1.0/pdftract.schema.json with form_fields $defs and required field
- Add form_fields_to_markdown() to markdown module for Form Fields footer table
Schema shape: /form_fields is array of {name, type, value, default?, page_index?, rect?,
required, read_only, multiline?, max_length?, options?, multi_select?, selected?,
state_name?, pushbutton?, radio?}. Type field is tagged enum: "text", "button", "choice",
"signature". Value field varies by type (string|boolean|string|array|uint|null).
Closes: pdftract-5qca
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Implements 7.5.2: Filespec dictionary and EF stream decoder for PDF
embedded file attachments. Extracts filename (/UF preferred over /F),
description, MIME type, size, dates, and MD5 checksum from Filespec
dictionaries and decodes the embedded stream data.
Key additions:
- AttachmentBuilder struct with all attachment metadata fields
- extract_one() function for resolving Filespec and decoding EF stream
- PDF string decoding (UTF-16BE BOM, UTF-16BE without BOM, PDFDocEncoding)
- PDF date to ISO 8601 parsing (reused from signature module)
- 50 MB size limit enforcement with truncation flag
- Support for all Phase 1 stream filters (FlateDecode, LZWDecode, etc.)
Closes: pdftract-3lir
All 4 child beads verified closed (pdftract-1w5u1, pdftract-4q8cq, pdftract-4sky1, pdftract-653ah).
Doctor subcommand fully functional with:
- Module structure: checks/, output/ submodules
- Exit code policy: 0 for OK/WARN, 1 for FAIL
- JSON output via --json flag
- Features listing via --features flag
- Catch_unwind protection for all checks
- Runbook integration at docs/operations/manual-platform-smoke.md
- 12 unit tests passing
Closes: pdftract-2um5s
- Created docs/operations/manual-platform-smoke.md with comprehensive
smoke test runbook for KU-12 quarterly manual platform testing
- Added troubleshooting table covering all 14 doctor checks
- Cross-referenced runbook from installation.md and quickstart.md
- Added CI gate test (doctor_runbook_coverage.rs) to verify
troubleshooting table completeness
Acceptance criteria:
✓ Step 1: pdftract doctor as first section in runbook
✓ Troubleshooting table covers all FAIL-capable checks
✓ installation.md mentions pdftract doctor with runbook link
✓ quickstart.md uses pdftract doctor as first example command
✓ CI gate parses runbook and asserts all checks are present
✓ mdBook build succeeds
✓ No broken internal links
Closes: pdftract-653ah
- Add STREAM_INVALID_CCITT diagnostic code for missing/invalid /Columns
- Modify CCITTFaxDecoder to use default /Columns (1728) when missing
- Emit STREAM_INVALID_CCITT diagnostic when /Columns is missing
- Emit OCR_CCITT_UNSUPPORTED diagnostic when full-render and libtiff unavailable
- Add unit tests for CCITT decoder parameter parsing and passthrough
Acceptance criteria:
- CCITT stream with full-render + libtiff → pass-through, no diagnostic
- CCITT stream WITHOUT full-render → OCR_CCITT_UNSUPPORTED diagnostic
- /K=-1 /Columns=2480 /BlackIs1=true → all 3 params recorded on ParsedCCITTParams
- Missing /Columns → STREAM_INVALID_CCITT diagnostic + default width 1728
- Round-trip test with CCITT fixture data
Closes: pdftract-66ykq
Add docs/integrations/mcp-clients.md with copy-paste-ready configuration
snippets for Claude Desktop, Cursor, Continue, and a custom SDK template.
Each section includes:
- Per-OS config file locations
- JSON/YAML snippets
- Validation steps
- Minimum client version verified
Also includes:
- Multi-client HTTP mode setup
- TH-03 compliance note (auth required for public binds)
- Troubleshooting for common failure modes
- Cross-references to sdk-invocation.md, KU-5, OQ-07
Closes: pdftract-3om3
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
- Update CODE_OF_CONDUCT.md to official Contributor Covenant v2.1 text
- Change enforcement contact from security@jedarden.com to community@jedarden.com
- Add links to CODE_OF_CONDUCT.md from all issue templates
- Add Code of Conduct link to README Contributing section
Satisfies GitHub Community Standards requirement for CODE_OF_CONDUCT.md
linked from issue templates and README.
Refs: pdftract-4618
Signed-off-by: jedarden <github@jedarden.com>
Implements Phase 6.6.2 atomic file write infrastructure with temp-file-and-rename
pattern. File-backed outputs now write to a temporary file and only rename to the
target path on successful commit. If the writer is dropped without committing, the
temporary file is automatically removed.
Key changes:
- New AtomicFileWriter module with temp file generation (pid + random suffix)
- CLI extract command gains --output option (default: "-" for stdout)
- All formats (json, text, markdown) write through AtomicFileWriter
- Drop safety: temp files cleaned up on panic or early return
- Unit tests verify commit, drop cleanup, and concurrent write scenarios
Acceptance criteria:
- ✓ Critical test: panic mid-extraction → no partial output files
- ✓ Successful extraction: temp file renamed to target
- ✓ Concurrent extractions: no collision (random suffix)
- ✓ Drop cleanup: orphaned temp files removed
Closes: pdftract-68wfa
Adds curved arrows between consecutive blocks in reading order with
numeric labels. Arrows use quadratic bezier curves with control points
at midpoint + 10pt downward. Limits to 50 arrows to prevent visual
clutter.
- Add render_reading_order function returning SVG path and text elements
- Include data-* attributes for tooltip consumption
- Add comprehensive unit tests (10/10 passing)
- Export reading_order module from inspect/render/mod.rs
Acceptance criteria:
- Helper compiles and produces valid SVG output ✅
- Layer is independently toggleable via CSS class ✅
- data-* attrs populated ✅
- Unit tests pass ✅
Closes: pdftract-6559n
Implement the DCTDecode (JPEG) passthrough filter with marker validation
and /ColorTransform metadata parsing.
Changes:
- Add StreamInvalidJpeg diagnostic code for missing SOI/EOI markers
- Implement DCTDecoder struct with:
- SOI (0xFFD8) marker validation
- EOI (0xFFD9) marker validation
- /ColorTransform parameter parsing
- Raw byte passthrough with bomb limit enforcement
- Replace PassthroughDecoder with DCTDecoder in get_decoder()
- Add comprehensive test coverage (6 test cases)
The decoder validates JPEG markers but passes through data even when
markers are missing (INV-8 error recovery). Diagnostics are emitted
for missing markers but currently dropped due to trait limitations
(future enhancement will add diagnostics buffer to StreamDecoder).
Closes: pdftract-66dd8
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Add button field value extraction distinguishing pushbutton, checkbox,
and radio button types via /Ff flags. Extracts selected state and
appearance state name (/Yes, /Off, custom).
- New module: forms/value_button.rs with ButtonKind enum and ButtonValue
- Updated FormFieldValue::Button variant with kind and state_name fields
- 15 unit tests covering all button types and edge cases
- Fixed CCITTFaxDecoder test syntax blocking test execution
Closes: pdftract-66pgk
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Add render_confidence_heatmap() function that creates per-glyph
translucent colored cells representing extraction confidence.
Color coding:
- Red (#ef4444): confidence < 0.5 (low)
- Yellow (#eab308): 0.5 <= confidence < 0.8 (medium)
- Green (#22c55e): confidence >= 0.8 (high)
- Gray (#94a3b8): no confidence value (direct extraction)
Each cell includes data-* attributes (data-char, data-confidence,
data-span-index) for tooltip consumption by the frontend inspector
(Phase 7.9.6).
Implementation approximates per-glyph positions using span bbox
and character count, since the JSON schema only has span-level
confidence.
All unit tests pass. CSS class "heatmap-cell" enables frontend
toggling (Phase 7.9.3).
Closes: pdftract-67p2c
Implements Phase 5.6.3: FeatureSignals extraction computed during Phase 4 assembly.
- Added profiles/signals.rs module with PageSignalAccumulator and extract_feature_signals()
- Predefined text patterns: currency symbols, ISO dates, INVOICE, WHEREAS, Abstract, References, page numbers, bullets, math operators
- Per-page signal extraction: text content, fonts, table count, heading depth, glyph density
- Document-level aggregation: page count, font diversity, presence flags (signature field, form field, math operators, bullet lists, footer page numbers)
- All regex patterns compiled once via OnceLock for performance
- 23 unit tests covering all functionality
Closes: pdftract-49cn
Add two PDF/A fixtures for testing assisted-OCR (BrokenVector path):
- Aligned fixture with correctly-positioned invisible text layer
- Misaligned fixture with text layer offset by (10pt, 5pt)
Extend ci/wer-gate.sh with WER validation for BrokenVector fixtures.
Acceptance criteria:
- Two BrokenVector fixtures committed (both 1.5 KB, well under 200 KB limit)
- ci/wer-gate.sh extended with new fixture invocations
- WER delta tests will skip gracefully when OCR environment unavailable
Closes: pdftract-48ea
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Implement cluster_spans_into_lines for Phase 4.2 line formation.
Groups spans into lines by baseline proximity using threshold 0.5 * median_font_size.
- Add HasFontSize trait for types with font_size
- Implement cluster_spans_into_lines function
- Compute baseline for each span
- Sort by baseline ASC
- Sweep and cluster within threshold
- Emit Line per cluster
- Sort spans by x0 within each line
- Add finalize_line_cluster helper
- Export new items from layout module
Tests: All 11 acceptance criteria tests pass
- Spans baselines 100, 100.5, 105 with median 12: one line
- Spans baselines 100, 110 with median 12: two lines
- Superscript stays on same line as base text
- Empty input produces empty output
- Threshold is 0.5 * median_font_size (not hardcoded)
Closes: pdftract-6bwq4
Implement Phase 5.3.2a: histogram-based contrast normalization for OCR
preprocessing. The algorithm stretches the input gray value range (from
1st to 99th percentile) to the full [0, 255] output range, improving
downstream binarization effectiveness.
Key implementation details:
- 256-bin histogram computation for percentile calculation
- 1st/99th percentile robustness against hot pixels and artifacts
- In-place mutation for performance (no double allocation)
- Proper error handling for uniform images and invalid dimensions
- Overflow-safe arithmetic using i32 intermediate values
Acceptance criteria:
- Image with [50, 200] range → stretched to [0, 255]
- Hot pixel robustness: single 0/255 pixels handled correctly
- Uniform image → early return with UniformImage error
- Invalid dimensions (zero width/height) → InvalidDimensions error
- Full performance: < 50 ms for 8 MP images
Closes: pdftract-6dki1
Implement Phase 7.4.4: AcroForm + XFA field combiner with XFA-wins
precedence. This enables pdftract to handle hybrid PDF forms that
contain both AcroForm and XFA representations.
- Add FormFieldValue enum with Text, Button, Choice, Signature variants
- Add ChoiceValue enum for single/multiple choice selections
- Implement combine() function that merges AcroForm and XFA fields
with XFA values taking precedence on collision
- Implement XFA boolean string conversion ("true"/"false"/"1"/"0")
to Button selected state
- Preserve AcroForm type hints when XFA provides the value
- Emit diagnostics for field name collisions
- Sort output alphabetically by field name
Closes: pdftract-2qum
Implement Phase 4.4 code block classification for detecting indented
monospace code blocks.
Features:
- is_monospace_font_name: Check font name for monospace indicators
(mono, courier, code, fixed, console - case-insensitive)
- is_fixed_pitch_flag: Check FontDescriptor bit 0 (FixedPitch)
- classify_code: Classify block as code if all spans monospace AND
indented ≥ 2em from column baseline
- classify_page_code_blocks: Post-processing pass to upgrade paragraph
blocks to code kind
Acceptance criteria:
- All-Courier, indented 24pt, font_size 12pt (2em=24): Code ✓
- All-monospace, not indented: NOT Code ✓
- Mixed serif+monospace: NOT Code ✓
- One serif span at end: NOT Code ✓
- FixedPitch flag set, no "Mono" in name: STILL Code ✓
Closes: pdftract-8n270
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>