Implement the conformance test runner pattern that every SDK will
implement to validate against the shared test suite.
- Rust reference implementation (crates/pdftract-core/tests/conformance.rs)
* Full test suite loader and executor
* Comparison engine with min/max, string constraints, tolerances
* Skip logic for unsupported features and schema versions
* Report generation in JSON format
- CLI compare subcommand (crates/pdftract-cli/src/main.rs)
* pdftract compare - Compare actual vs expected with tolerances
* Cross-language comparison tool to avoid reimplementations
- Documentation (docs/conformance/sdk-contract.md)
* Complete pattern specification with pseudocode
* Per-language runner locations
* CI integration requirements
- Python reference stub (tests/python-conformance/test_conformance.py)
* Full pytest-based implementation following the pattern
Closes: pdftract-5omc
- book.toml with title, authors, build directory, edit-url-template
- src/SUMMARY.md with complete TOC for all planned sections
- src/introduction.md: what pdftract does and doesn't do (Non-Goals)
- src/installation.md: cargo, pip, Homebrew, Docker; KU-12 caveat verbatim
- src/quickstart.md: five-minute walkthrough with executable commands
- 39 draft placeholder files for CLI reference, schema, profiles, SDKs, advanced topics, troubleshooting, FAQ
mdbook build completes cleanly with zero warnings (linkcheck optional).
See notes/pdftract-1g87.md for verification details.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Verified the pdftract-ci WorkflowTemplate exists in declarative-config
and is correctly synced to the iad-ci cluster. All scaffolding
requirements met for Phase 0.1.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
HIGH:
- Add outline/bookmark traversal spec to Phase 1.4 (linked list walk, PDFDocEncoding vs UTF-16BE)
- Specify base64 encoding for attachment data field in Phase 7.5
- Move decompression limit to ExtractionOptions.max_decompress_bytes (universal, not serve-only);
add max_decompress_gb to CLI/Python/HTTP API surfaces
LOW:
- Split log+env_logger into two dep matrix rows for accurate crate count
- Add full_render to Python keyword args and HTTP form fields (with no-op note)
- Clarify v0.1.0 milestone: "all applicable" targets (OCR speed target excluded until v0.2.0)
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
New research document covering parallel extraction architecture:
rayon page-level parallelism, Arc<> shared xref/font/object-stream
caches, RwLock font cache design, Tesseract thread-local OCR pool,
semaphore memory budget, ordered NDJSON streaming slot array, and
catch_unwind error isolation per page.
Also adds docs/research-index.md: a 622-line navigable index of all
83 research documents grouped into 9 thematic categories, with a
"Start Here" reading path, per-phase implementation reading tables,
and an alphabetical lookup table covering every document.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Two new research documents covering Japanese Ruby text and East Asian
typography (tagged/untagged furigana extraction, Kinsoku Shori spacing,
full-width normalization, tate-chu-yoko, CJK/Latin boundary detection,
ruby_text output field) and PDF/VT variable and transactional printing
(DPart hierarchy traversal, per-record extraction model, DPM metadata,
variable vs. static content classification, postal address extraction,
records array output schema).
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Two new research documents covering Southeast Asian script extraction
(Thai/Khmer/Myanmar/Lao/Tibetan/Ethiopic — cluster structure, no-space
word boundary policy for Thai/Lao, Zawgyi vs Unicode detection for
Myanmar, USE shaping, Tesseract fallback) and OpenType MATH table
exploitation for formula extraction (MathConstants for fraction/
subscript/radical layout, TeX OML/OMS/OMX encoding tables, MathML
output generation, GlyphAssembly reconstruction, alternative text
and MathJax XMP source recovery).
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Two new research documents covering the glyph-to-span-to-block assembly
pipeline (inter-operator merging, adaptive word gap threshold, column
detection, ligature bbox splitting, multi-granularity output) and
Unicode post-processing (NFC normalization, selective NFKC decomposition
for ligatures, PUA preservation, soft hyphen resolution, ZWJ/ZWNJ
handling, combining character reordering).
Also adds docs/plan/implementation-plan.md: the full 7-phase Rust
implementation roadmap covering core parser, font/encoding pipeline,
content stream processing, text assembly, OCR integration, API surface,
and advanced features — with crate selections, complexity ratings,
test strategy, and v0.1–v1.0 release milestones.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Four new extraction research documents covering PDF article thread
traversal for multi-flow magazine layouts, resource dictionary
inheritance and ResourceStack semantics for nested Form XObjects,
document catalog and page tree structure (UserUnit, Contents array,
page inheritance), and hyperlink/named destination extraction with
QuadPoints anchor text and link density classification.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Four new extraction research documents covering cross-reference table
and xref stream parsing with error recovery, PDF object model and lexer
correctness (all 8 types, string escapes, stream /Length recovery),
FontDescriptor fields and embedded font data (Type1/TrueType/CFF/OT),
and PDF/UA-2 / PDF 2.0 structure changes (MathML, NFC normalization,
new structure types, artifact classification improvements).
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Four new extraction research documents covering ICC profile and color
space luminance estimation for text visibility, precise text state
tracking and bounding box computation (Tc/Tw/Tz/TL, font units, TJ
kerning, baseline clustering), PDF/X prepress handling (OutputIntent,
TrimBox, spot colors, article threading), and a complete content stream
operator reference (BT/ET, Tj/TJ/'/", BI/ID/EI, BX/EX, marked content).
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Four new extraction research documents covering page label/PageLabels
number tree and outline/bookmark tree extraction, government form PDF
patterns (IRS, USCIS, court filings, classification markings), book and
publishing PDF structure (running heads, footnotes, index extraction),
and PDF stream filter pipeline (FlateDecode/LZW predictors, JBIG2 global
segments, CCITTFax, JPX, error boundaries).
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Four new extraction research documents covering PDF portfolio and
attachment enumeration (ZUGFeRD, PDF/A-3 AFRelationship), incremental
update structure and xref chaining, PDF/UA tagged PDF deep dive with
all 36 structure types and MCID mechanics, and JavaScript/AcroForm/XFA
field extraction without script execution.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Four new extraction research documents covering color space and contrast
analysis for text visibility, medical/scientific document structure
(ICH E3, IMRaD, FDA labeling, eCTD), multilingual mixed-script extraction
with UBA bidi handling and CJK vertical text, and digital signature
metadata extraction with DocMDP integrity context.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Four new extraction research documents covering text rendering modes
(Tr 0-7 including invisible OCR layers), legal/financial document
extraction patterns, character-level confidence aggregation with output
schema, and PDF/E engineering document handling (CAD, GD&T, schematics).
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Three new extraction research documents covering subset font Unicode
recovery, pdfLaTeX/XeLaTeX encoding tables and two-column layout, and
proper vs. improper redaction detection with output schema.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- word-boundary-reconstruction: expected position formula with Tc/Tw/Tz,
TJ kerning gap detection, Td/Tm jump analysis, four space-width threshold
strategies including adaptive histogram, multi-column gap discrimination
- scanned-vs-vector-page-classification: four-category taxonomy, fast
pre-checks, image coverage AABB computation, character density ratio,
validity rate, glyph bbox plausibility, region routing map, confidence
scoring with cost-aware OCR threshold
- pdfa-compliance-and-extraction: ISO 19005 part/level matrix, XMP
pdfaid detection, Level B/U/A guarantee implications for extraction,
font embedding requirements, artifact tagging, PDF/A-3 embedded files,
PdfaLevel enum with per-level fast-path branching
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Covers TypeScript, C#, C++, PHP, and Kotlin gaps with full code examples
for both subprocess and HTTP tracks, NuGet RID packaging detail, PHP FFI
options, and implementation sequencing.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Four research documents covering PDF spec fundamentals, font types and
encoding, glyph Unicode recovery, and tagged PDF structure/reading order.
SDK invocation notes with subprocess and HTTP examples for Python, Node.js,
Go, Ruby, Java, Rust, and Bash.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>