Add verification note documenting JSON-RPC 2.0 framing implementation
with all acceptance criteria PASS.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Add hand-rolled JSON-RPC 2.0 implementation for MCP server transports.
Module: crates/pdftract-cli/src/mcp/framing/
- Id enum with Number/String/Null variants preserving JSON type
- Request, Response, Notification, ErrorObject structs
- BatchMessage for batch request handling
- Strict jsonrpc version validation (must be "2.0")
- All 6 spec-defined error codes (-32700, -32600, -32601, -32602, -32603, -32099..-32000)
- Constructor helpers for common patterns
Acceptance criteria verified:
- Round-trip serialization/deserialization
- ID type preservation (number/string/null)
- Parse error responses with null id
- Method not found error construction
- Notification detection (no id field)
- Batch request handling
- Rejection of invalid jsonrpc versions
- Empty batch rejection
16 unit tests covering all spec requirements.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Add compute-sha256sums step to pdftract-ci publish-if-tag that produces
an aggregate SHA256SUMS file covering all distributed artifacts: binary
archives, Python wheels, sdist, and CycloneDX SBOM.
Key changes:
- Glob-based artifact collection (tar.gz, zip, whl, cdx.json)
- Deterministic sorting with LC_ALL=C sort -k 2 for reproducibility
- Local verification via sha256sum --check before publishing
- Dynamic artifact upload array instead of hardcoded EXPECTED_ARTIFACTS
- SBOM added as optional input artifact
The SHA256SUMS file format matches GNU coreutils sha256sum output,
enabling one-command verification with cosign verify-blob.
References:
- Plan line 3369: SHA256SUMS aggregate
- Plan line 3419: sign-blob of SHA256SUMS
- Plan line 3460: one cosign verify-blob umbrella
Co-Authored-By: Claude Code <noreply@anthropic.com>
- Wire generate-provenance and verify-provenance steps into workflow DAG
- Update publish-if-tag to upload multiple.intoto.jsonl to GitHub Release
- Fix provenance reproducibility by using SOURCE_DATE_EPOCH from git commit
- Docker images already have cosign attest --type slsaprovenance
Acceptance criteria:
- PASS: generate-provenance step wired into DAG
- PASS: provenance uploaded to GitHub Release
- PASS: Docker image cosign attest already implemented
- WARN: Full slsa-verifier verification requires OIDC issuer registration
- PASS: Provenance is reproducible using git commit timestamp
- PASS: Automated smoke test validates JSON structure
Refs: pdftract-3gk5, plan line 3415 (Signing and Provenance)
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Complete implementation of the Pdftract NuGet package as a subprocess-
based SDK with async-first design using System.Diagnostics.Process and
System.Text.Json.
Implementation:
- All 9 contract methods (ExtractAsync, ExtractTextAsync, etc.) with sync
wrappers in Pdftract.Sync.cs
- 8 exception types inheriting from PdftractException base class
- Source discriminated union (PathSource, UrlSource, BytesSource) with
FromPath, FromUrl, FromUri, FromBytes factory methods
- C# record types for all models (Document, Page, Metadata, etc.)
- ExtractOptions, SearchOptions, HashOptions with PascalCase properties
- Source-generated JSON serialization via JsonContext for Native AOT
- IAsyncEnumerable streaming for NDJSON outputs
- CancellationToken propagation to Process.Kill(entireProcessTree: true)
Bug fixes:
- Fixed ArgumentList handling (was adding List as single element)
- Added source.Dispose() cleanup for BytesSource temporary files
- Added cleanup for VerifyReceiptAsync temporary receipt file
- Added process.EnableRaisingEvents for proper event handling
- Fixed output capture to include newlines between lines
- Changed to source-generated JSON (JsonContext) instead of reflection
Acceptance criteria:
- All 9 methods exposed as both async and sync variants
- All 8 exception classes inherit from PdftractException
- Models as C# records
- Supports net8.0 and net9.0
- CancellationToken terminates subprocess
Files modified:
- pdftract-dotnet/src/Pdftract/Pdftract.cs
- pdftract-dotnet/src/Pdftract/Pdftract.Sync.cs
- pdftract-dotnet/src/Pdftract/Source/Source.cs
- pdftract-dotnet/src/Pdftract/Models/Document.cs
- pdftract-dotnet/src/Pdftract/Models/JsonContext.cs
- pdftract-dotnet/tests/Pdftract.Tests/ConformanceTests.cs
- pdftract-dotnet/README.md
- pdftract-dotnet/notes/pdftract-1w22d.md
Co-Authored-By: Claude Code <noreply@anthropic.com>
Add verification note confirming that per-page Resource dictionary
inheritance is complete and all acceptance criteria are met.
The implementation in resources.rs and pages.rs provides:
- Per-namespace merging (Font, XObject, ExtGState, ColorSpace, etc.)
- Per-key last-write-wins semantics
- Arc sharing for memory efficiency when pages lack /Resources
- Support for inline ColorSpace arrays
All 10 resource-related tests pass, including:
- 3-level inheritance test
- Per-key override test
- Arc sharing test
- ColorSpace inline array test
- Empty root /Resources test
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Enhanced the `detect_linearization` function to avoid false matches when
extracting keys from the linearization dictionary. Previous implementation
could incorrectly match "/L" within "/Linearized" or "/H" within other keys.
Changes:
- Added loop-based search in extract_number helper to skip substring matches
- Added similar substring-aware logic for /H (hint stream) parsing
- Added new diagnostic codes for /Prev chain error handling
- Added comprehensive verification note
Acceptance criteria PASS:
- Non-linearized files return None
- Valid linearized dict detected correctly
- File size mismatch (incremental update) invalidates linearization
- No /H entry returns None for hint_stream_offset
- Random bytes never panic (proptest)
- Forward scan disabled for linearized files
- INV-8 maintained (no panics on arbitrary input)
Co-Authored-By: Claude Code <noreply@anthropic.com>
The 512 MiB DEFAULT_MAX_DECOMPRESS_BYTES change was implemented in
commit e94f2ab (fix(bf-49wmw)). This note documents the verification.
Co-Authored-By: Claude Code <noreply@anthropic.com>
The hybrid xref handler (merge_hybrid) was already implemented. This adds
a property-based test to verify it handles random combinations of traditional
and stream entries without panicking.
Changes:
- Added proptest_merge_hybrid_no_panic to proptest_tests module
- Tests random entry sets using prop::collection::hash_map
- Covers all entry types (InUse, Free, Compressed)
- Verification note confirms all acceptance criteria PASS
Test results: 9/9 merge_hybrid tests pass
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Fixed compilation error in xref.rs where u64 literal 0x5DEECE66D was used
with u32 state, causing overflow. Changed state to u64 for proper Java
Random algorithm behavior.
The OCG /OCProperties parsing implementation was already complete and
all tests pass. See notes/pdftract-2a6rk.md for verification.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Implements merge_hybrid() and is_hybrid_trailer() for hybrid PDF files.
Hybrid files have both a traditional xref table at startxref and a
supplementary xref stream pointed to by /XRefStm in the trailer.
Per PDF spec, the traditional table is authoritative for objects it
covers; the stream's type-2 entries fill gaps not covered by the
traditional table.
Key behaviors:
- Traditional entries override stream entries for same object numbers
- Stream-only type-2 entries are added as gap fill
- Free/InUse conflicts emit STRUCT_HYBRID_CONFLICT diagnostic
- Merged trailer has /XRefStm key removed
- Result XrefSection has is_hybrid: true set
Acceptance criteria:
- Critical test: traditional entries override stream entries (PASS)
- Gap fill: stream-only type-2 entries added (PASS)
- Free/InUse conflict: diagnostic emitted (PASS)
- Non-hybrid trailer: is_hybrid_trailer returns false (PASS)
- proptest: no panics with random combinations (PASS)
- INV-8 maintained: no panics in library code (PASS)
Co-Authored-By: Claude Code <noreply@anthropic.com>
Add a Repository Layout subsection: SDK source lives at root-level pdftract-<lang>/
in this monorepo (single source of truth), generated via pdftract sdk codegen and
published to language registries from here. Retire the legacy standalone repos.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Consolidate the .NET, Java, and Node SDKs into root-level pdftract-<lang>/
directories (matching the already-tracked pdftract-go/), per the decision to
make the generated SDKs first-class monorepo members rather than separate repos.
Content imported from the standalone ~/pdftract-<lang> repos (build artifacts
excluded). Removes the broken empty-git nested clones that were polluting the
working tree.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The 12 synthetic malformed fixtures (generate_test_corpus.py output, tracked in
PROVENANCE.md) existed only as untracked files and were swept by a cleanup stash,
breaking the provenance pre-commit hook for all commits. Restore from stash and
commit them as tracked files so they cannot be lost again.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Add a Memory targets table as a first-class acceptance criterion alongside
Accuracy/Speed/Weight, with a hard per-document peak-RSS ceiling that must not
scale with input/payload. Promote OOM-safety to a Tier-1 hard gate. Reconcile
the contradictory 2 GB max_decompress_bytes default to the research-backed 512 MB
(root cause of an observed multi-GB OOM via the unbounded PNG-predictor pre-alloc
under rayon page parallelism).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- Change resolve function signature from Fn(ObjRef) -> Option<PdfObject>
to Fn(ObjRef) -> Option<PdfStream> for type safety
- Fix caching: load_object_stream now properly populates cache
- Fix error propagation for /Extends chains (CircularRef, DepthExceeded)
- Fix test data: add whitespace between embedded objects for lexer
- Fix compilation error in test_truncated_objstm_body
All 16 objstm tests now pass.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Verifies that the per-page Resource dictionary inheritance implementation
is complete and correct. All acceptance criteria are met:
- 3-level resource inheritance test passes
- Per-key override test passes
- /Resources missing on page inherits parent's
- Arc<ResourceDict> sharing verified with Arc::ptr_eq
- ColorSpace inline-array test passes
- Empty root /Resources propagates correctly
- INV-8 maintained (all fuzz tests pass)
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Add documentation for the fix that removed diagnostic emission for
unknown keywords, complementing the earlier keyword fallback fix.
Co-Authored-By: Claude Code <noreply@anthropic.com>
The lexer should not emit diagnostics for unknown keywords because:
1. Many valid keywords (trailer, xref, etc.) are not in the initial dispatch table
2. The object parser is responsible for validating keywords against known operators
3. Emitting diagnostics here causes false positives for valid PDF constructs
This change aligns with the task requirement that unknown keywords emit
Token::Keyword without a diagnostic, letting the object parser handle
STRUCT_UNKNOWN_KEYWORD if needed.
Co-Authored-By: Claude Code <noreply@anthropic.com>
Fixed incorrect fallback behavior in keyword lexer functions. Four
functions (lex_e_keyword, lex_o_keyword, lex_r_keyword, lex_n_keyword)
were incorrectly calling lex_name() instead of lex_keyword() when
keywords didn't match.
When a PDF contains an unrecognized word starting with e/o/n/R
(e.g., "endob" instead of "endobj"), the lexer should fall back to
generic keyword parsing (Token::Keyword(bytes)), not name parsing.
Names always start with /, so calling lex_name() on input without
a leading / would incorrectly skip the first byte.
References:
- Bead: pdftract-5upi
- Notes: notes/pdftract-5upi.md
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Documents that CycloneDX SBOM generation is fully implemented
in the Argo Workflows (declarative-config). The workflows:
- Generate pdftract-vX.Y.Z.cdx.json using cargo-cyclonedx
- Validate schema with cyclonedx-cli validate
- Attest to Docker images via cosign attest --type cyclonedx
- Attach to GitHub Release as an asset
- Include in SHA256SUMS aggregate
Acceptance criteria: 5 PASS, 1 WARN (grype test requires release)
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Documents the completed implementation of pdftract-release-cascade
WorkflowTemplate and pdftract-tag-trigger Argo Events Sensor.
Acceptance criteria:
- PASS: All infrastructure files committed in declarative-config
- WARN: Runtime verification deferred (kubectl not available in env)
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Documents the enhancements made to cosign keyless signing:
- Projected service account token with sigstore audience
- Explicit OIDC issuer URL configuration
- Improved digest extraction with fallback strategies
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Make diagnostics module visible to fingerprint module and fix
hash_page_geometry signature to match usage.
Changes:
- Add `pub mod diagnostics;` to lib.rs for module visibility
- Modify hash_page_geometry to create diagnostics internally
The canonicalize module already has complete implementation:
- canonicalize_f64: banker's rounding to 4dp for geometry
- normalize_content_stream: whitespace normalization via lexer
- serialize_dict_canonical: sorted-key dict serialization
- hash_resource_dict_canonical: order-independent resource hashing
Verification: notes/pdftract-154mz.md
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Documents the WorkflowTemplate creation for mdBook → Cloudflare Pages CI.
Template committed to declarative-config 4fe4947.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Documents the implementation of the pdftract-github-release
WorkflowTemplate, including artifact taxonomy, release notes
generation, and acceptance criteria status.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Add per-PR property tests and nightly fuzz job infrastructure:
CI Changes (declarative-config):
- pdftract-ci.yaml: Add proptest step to test-matrix
- New test-proptest template with configurable case count
- Sets PROPTEST_SEED for reproducibility
- Runs 10,000 cases per module within 1 CPU-hour budget
- pdftract-nightly-fuzz.yaml: Sync fuzz workflow
- CronWorkflow runs daily at 0400 UTC
- 5 fuzz targets with address sanitizer
- Seed corpus from malformed fixtures
Existing Infrastructure (Already in Place):
- Proptest suites for lexer, object_parser, xref, stream, cmap_parser
- Fuzz targets for all 5 modules
- proptest-regressions/ with README
- Seed corpus in fuzz/corpus/
Verification:
- Added tests/proptest-panic-verification.rs
- Proptest infrastructure correctly structured
- Will catch deliberate panics within budget
Closes: pdftract-33v
Documents the implementation of the pdftract-crates-publish WorkflowTemplate
in jedarden/declarative-config.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
The OCG implementation was already complete in ocg.rs. All 20 tests pass:
- BaseState parsing (ON/OFF/Unchanged)
- /ON and /OFF array override handling
- OCMD policy preservation (AllOn, AnyOn, AllOff, AnyOff)
- INV-8 compliance verified via proptests
Phase 3 will consume OcProperties via is_visible() to suppress
glyphs in /OC /OCGRef BDC blocks when the referenced OCG is OFF.
Co-Authored-By: Claude Code <noreply@anthropic.com>
Add details about the BytesSource cleanup bug fix and clarify that the
contract defines 7 error kinds, not 8 as initially stated in the task.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
- Add source Source parameter to invoke, invokeJSON, invokeString, invokeStream
- Change BytesSource from []byte type to struct with data and tmpPath fields
- Add proper cleanup of temporary files after subprocess execution
- Fix source parameter pass-through in Extract, ExtractText, ExtractMarkdown, GetMetadata, Hash, Classify
This ensures BytesSource temporary files are cleaned up after use, preventing
file descriptor leaks. The BytesSource now creates a temp file on demand and
cleans it up automatically via defer in the invoke methods.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>