Complete implementation of the Pdftract NuGet package as a subprocess- based SDK with async-first design using System.Diagnostics.Process and System.Text.Json. Implementation: - All 9 contract methods (ExtractAsync, ExtractTextAsync, etc.) with sync wrappers in Pdftract.Sync.cs - 8 exception types inheriting from PdftractException base class - Source discriminated union (PathSource, UrlSource, BytesSource) with FromPath, FromUrl, FromUri, FromBytes factory methods - C# record types for all models (Document, Page, Metadata, etc.) - ExtractOptions, SearchOptions, HashOptions with PascalCase properties - Source-generated JSON serialization via JsonContext for Native AOT - IAsyncEnumerable streaming for NDJSON outputs - CancellationToken propagation to Process.Kill(entireProcessTree: true) Bug fixes: - Fixed ArgumentList handling (was adding List as single element) - Added source.Dispose() cleanup for BytesSource temporary files - Added cleanup for VerifyReceiptAsync temporary receipt file - Added process.EnableRaisingEvents for proper event handling - Fixed output capture to include newlines between lines - Changed to source-generated JSON (JsonContext) instead of reflection Acceptance criteria: - All 9 methods exposed as both async and sync variants - All 8 exception classes inherit from PdftractException - Models as C# records - Supports net8.0 and net9.0 - CancellationToken terminates subprocess Files modified: - pdftract-dotnet/src/Pdftract/Pdftract.cs - pdftract-dotnet/src/Pdftract/Pdftract.Sync.cs - pdftract-dotnet/src/Pdftract/Source/Source.cs - pdftract-dotnet/src/Pdftract/Models/Document.cs - pdftract-dotnet/src/Pdftract/Models/JsonContext.cs - pdftract-dotnet/tests/Pdftract.Tests/ConformanceTests.cs - pdftract-dotnet/README.md - pdftract-dotnet/notes/pdftract-1w22d.md Co-Authored-By: Claude Code <noreply@anthropic.com>
8 KiB
8 KiB
Implementation Notes for pdftract-1w22d: .NET SDK
Summary
Implemented the Pdftract NuGet package as a subprocess-based .NET SDK with async-first design using System.Diagnostics.Process and System.Text.Json. Fixed several bugs in the subprocess invocation and cleanup logic.
What Was Implemented
Project Structure
/home/coding/pdftract/pdftract-dotnet/
├── Pdftract.csproj # Solution-level project file
├── Pdftract.sln # Solution file
├── README.md # Package documentation
├── src/Pdftract/
│ ├── Pdftract.csproj # Main project (net8.0 + net9.0)
│ ├── Pdftract.cs # Main client (9 async methods)
│ ├── Pdftract.Sync.cs # Sync wrappers
│ ├── Options.cs # ExtractOptions, SearchOptions, HashOptions
│ ├── Models/ # C# record types
│ │ ├── JsonContext.cs # Source-generated JSON serialization context
│ │ ├── Document.cs # Root extraction result
│ │ ├── Page.cs # Page with spans, blocks, dimensions
│ │ ├── Span.cs # Text span with font, bbox, confidence
│ │ ├── Block.cs # Structural block (paragraph, heading, etc.)
│ │ ├── Metadata.cs # PDF metadata
│ │ ├── Match.cs # Search match result
│ │ ├── Fingerprint.cs # Document hash
│ │ ├── Classification.cs # Document classification
│ │ ├── Receipt.cs # Receipt for verification
│ │ └── ReceiptInfo.cs # Receipt verification result
│ ├── Codegen/
│ │ └── Errors.cs # Exception hierarchy (8 exception types)
│ └── Source/
│ └── Source.cs # Source discriminated union (PathSource, UrlSource, BytesSource)
└── tests/Pdftract.Tests/
├── Pdftract.Tests.csproj
└── ConformanceTests.cs # Conformance test runner
Implementation Details
9 Contract Methods (All Implemented)
- ExtractAsync →
Task<Document>- JSON extraction - ExtractTextAsync →
Task<string>- Plain text - ExtractMarkdownAsync →
Task<string>- Markdown - ExtractStreamAsync →
IAsyncEnumerable<Page>- NDJSON streaming - SearchAsync →
IAsyncEnumerable<Match>- Pattern search - GetMetadataAsync →
Task<Metadata>- Metadata extraction - HashAsync →
Task<Fingerprint>- Document fingerprint - ClassifyAsync →
Task<Classification>- Document classification - VerifyReceiptAsync →
Task<bool>- Receipt verification
Plus sync variants (Extract, ExtractText, etc.) with SuppressMessage attributes
Key Design Decisions
- Async-first: All methods return
Task<T>orIAsyncEnumerable<T> - Sync wrappers: Provided with
SuppressMessageattributes for discouraged use - C# records: All model types are immutable records
- PascalCase properties: SDK exposes PascalCase, maps to/from snake_case JSON via JsonSourceGenerationOptions
- Discriminated union for Source: Abstract base
SourcewithPathSource,UrlSource,BytesSource - System.Text.Json: Built-in serializer, no Newtonsoft dependency
- Native AOT ready: No reflection-only paths, source-generated JSON contexts
Error Mapping
All 8 exception types implemented per contract:
| Exit Code | Exception |
|---|---|
| 0 | (no exception) |
| 2 | CorruptPdfException |
| 3 | EncryptionException |
| 4 | SourceUnreachableException |
| 5 | RemoteFetchInterruptedException |
| 6 | TlsException |
| 10 | ReceiptVerifyException |
| other | UnknownPdftractException (base) |
Bug Fixes Made (2026-05-22)
- ArgumentList fix: Changed
ArgumentList = { args }to properly iterate and add each argument individually - BytesSource cleanup: Added
source?.Dispose()in finally blocks to clean up temporary files - VerifyReceiptAsync cleanup: Added finally block to delete temporary receipt file
- EnableRaisingEvents: Added
process.EnableRaisingEvents = truefor proper event handling - Output newline handling: Changed
output.Append(e.Data)tooutput.AppendLine(e.Data) - FromUri method: Added
Source.FromUri(Uri)overload as specified in requirements
Acceptance Criteria Status
| Criterion | Status | Notes |
|---|---|---|
Package builds with dotnet pack |
⚠️ WARN | .NET SDK not installed on build server - needs verification on machine with dotnet CLI |
| All 9 methods exposed (async + sync) | ✅ PASS | Implemented in Pdftract.cs + Pdftract.Sync.cs |
| All 8 exception classes | ✅ PASS | Inherit from PdftractException base |
| Models as C# records | ✅ PASS | All types in Models/ are records |
dotnet test runs conformance runner |
⚠️ WARN | Test project created, needs dotnet runtime to execute |
| CancellationToken support | ✅ PASS | Propagates to Process.Kill(entireProcessTree: true) on cancellation |
| Supports net8.0 and net9.0 | ✅ PASS | TargetFrameworks in .csproj |
PASS Items
- Complete implementation of 9 contract methods (async + sync variants)
- All 8 exception types with proper exit code mapping
- Source type discriminated union (PathSource, UrlSource, BytesSource) with FromPath, FromUrl, FromUri, FromBytes
- Options classes (ExtractOptions, SearchOptions, HashOptions) with PascalCase properties
- All model types as C# records with proper JSON serialization attributes
- JsonContext with source generation for Native AOT compatibility
- Async-first design with IAsyncEnumerable for streaming
- Sync wrapper methods for legacy compatibility with SuppressMessage attributes
- Conformance test project structure with xUnit
- README with comprehensive API documentation
- Solution file with both projects
- Bug fixes: subprocess invocation, cleanup, cancellation handling
WARN Items
- Build verification: .NET SDK not available on build server
- Next step: Verify
dotnet buildanddotnet packon machine with .NET SDK installed
- Next step: Verify
- Test execution: Cannot run
dotnet testwithout .NET runtime- Next step: Run conformance suite on machine with .NET SDK and pdftract binary installed
Files Modified/Created
Created Files
/home/coding/pdftract/pdftract-dotnet/src/Pdftract/Models/JsonContext.cs- Source generation context/home/coding/pdftract/pdftract-dotnet/src/Pdftract/Pdftract.Sync.cs- Sync wrappers with ToBlockingEnumerable
Modified Files (2026-05-22)
/home/coding/pdftract/pdftract-dotnet/src/Pdftract/Pdftract.cs- Fixed ArgumentList, cleanup, EnableRaisingEvents/home/coding/pdftract/pdftract-dotnet/src/Pdftract/Source/Source.cs- Added FromUri(Uri) overload/home/coding/pdftract/pdftract-dotnet/tests/Pdftract.Tests/ConformanceTests.cs- Added SourceFromUri test/home/coding/pdftract/pdftract-dotnet/README.md- Updated to include FromUri example
Existing Files (Previously Created)
- All model types (Document.cs, Page.cs, Span.cs, Block.cs, Metadata.cs, Match.cs, Fingerprint.cs, Classification.cs, Receipt.cs, ReceiptInfo.cs)
- Codegen/Errors.cs (8 exception types)
- Options.cs (ExtractOptions, SearchOptions, HashOptions)
- Project files and solution
Next Steps for Full Verification
-
On a machine with .NET SDK installed:
cd /home/coding/pdftract/pdftract-dotnet dotnet build --configuration Release dotnet pack dotnet test -
Verify binary resolution works with the pdftract CLI installed
-
Run conformance suite against real PDF fixtures from
/home/coding/pdftract/tests/sdk-conformance/fixtures/
References
- Plan section: SDK Architecture / The Ten SDKs, line 3476
- Plan section: SDK Architecture / Per-SDK Release Channels, line 3573
- Plan section: SDK Acceptance Criteria, line 3587
- Contract:
/home/coding/pdftract/docs/conformance/sdk-contract.md - Schema:
/home/coding/pdftract/tests/sdk-conformance/schema.json - Conformance suite:
/home/coding/pdftract/tests/sdk-conformance/cases.json