jedarden
006dfb286c
Add research: color visibility, medical/scientific, multilingual, digital signatures
...
Four new extraction research documents covering color space and contrast
analysis for text visibility, medical/scientific document structure
(ICH E3, IMRaD, FDA labeling, eCTD), multilingual mixed-script extraction
with UBA bidi handling and CJK vertical text, and digital signature
metadata extraction with DocMDP integrity context.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-16 15:41:43 -04:00
jedarden
eac3235291
Add research: rendering modes, legal/financial patterns, confidence scoring, engineering docs
...
Four new extraction research documents covering text rendering modes
(Tr 0-7 including invisible OCR layers), legal/financial document
extraction patterns, character-level confidence aggregation with output
schema, and PDF/E engineering document handling (CAD, GD&T, schematics).
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-16 15:35:48 -04:00
jedarden
8f8138a65e
Add research: font subsetting, LaTeX patterns, redaction detection
...
Three new extraction research documents covering subset font Unicode
recovery, pdfLaTeX/XeLaTeX encoding tables and two-column layout, and
proper vs. improper redaction detection with output schema.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-16 15:30:52 -04:00
jedarden
04b60a1cf7
Add three research documents: CJK encoding, pipeline synthesis, linearization
...
- cjk-and-asian-script-encoding: all six CJK encoding systems, Type 0
composite font pipeline, predefined CMap tables for Japan1/GB1/CNS1/Korea1,
Shift-JIS/GB18030/Big5 byte structure, missing ToUnicode recovery via
Adobe CID tables, full-width normalization, vertical text detection
- extraction-pipeline-overview: end-to-end 9-stage synthesis referencing
all 36 research documents; stages: file open, metadata, page classification,
content extraction (4 sub-paths), font pipeline, span assembly, normalization
and quality, supplementary content, output serialization; ASCII data-flow
diagram
- linearized-pdf-and-streaming: linearization dict keys, hint stream
bitfield tables, first-page xref lazy parsing, HTTP range request pattern,
staleness validation, incremental update interaction, NDJSON streaming,
partial file extraction, lazy PageIter API with rayon par_bridge
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-16 15:26:36 -04:00
jedarden
116db89c95
Add three research documents on routing and text reconstruction
...
- word-boundary-reconstruction: expected position formula with Tc/Tw/Tz,
TJ kerning gap detection, Td/Tm jump analysis, four space-width threshold
strategies including adaptive histogram, multi-column gap discrimination
- scanned-vs-vector-page-classification: four-category taxonomy, fast
pre-checks, image coverage AABB computation, character density ratio,
validity rate, glyph bbox plausibility, region routing map, confidence
scoring with cost-aware OCR threshold
- pdfa-compliance-and-extraction: ISO 19005 part/level matrix, XMP
pdfaid detection, Level B/U/A guarantee implications for extraction,
font embedding requirements, artifact tagging, PDF/A-3 embedded files,
PdfaLevel enum with per-level fast-path branching
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-16 15:22:08 -04:00
jedarden
9420964b73
Add three research documents on parser correctness fundamentals
...
- graphics-state-tracking: full q/Q stack, text state operators, color
space tracking, ExtGState keys, clip path management, CTM concatenation,
blend mode/soft mask visibility, Form XObject isolation, GraphicsState
Rust struct with is_text_visible implementation
- cmap-format-and-cid-encoding: CMap file structure, codespace range
scan grammar, bfchar/bfrange/cidchar/cidrange semantics, usecmap
inheritance with predefined CJK CMap inventory, mixed-length parsing
state machine, ToUnicode defect handling, Rust CMap struct design
- content-stream-concatenation: multi-stream concatenation with 0x0A
injection, continuous graphics state across boundaries, resource
inheritance page-tree walk, Form XObject and Type 3 resource isolation,
ResourceStack design, EI disambiguation in binary data, lazy decompression
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-16 15:16:41 -04:00
jedarden
f805e52fa3
Add four research documents focused on readable text production
...
- type3-font-extraction: CharProcs stream parsing, TeX/dvips naming
conventions, dHash shape fingerprinting, nested font stacks, OCR fallback
- watermark-and-background-separation: five PDF watermark mechanisms,
transparency tracking, cross-page repetition, WCAG contrast detection,
raster inpainting, diagonal watermark removal pipeline
- historical-and-degraded-document-extraction: eight degradation categories,
bleed-through removal, illumination correction, Sauvola binarization,
stroke reconstruction, Fraktur/long-s handling, confidence-gated output
- complex-layout-reading-order: baseline clustering, XY-cut, Docstrum,
RLSA smearing, mixed-layout detection, sidebar/inset/footnote ordering,
perplexity-based confidence with natural_order fallback
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-16 15:13:10 -04:00
jedarden
31e715633d
Add four research documents on text quality and document-type handling
...
- text-readability-validation: character/word/entropy/perplexity checks,
symbol font detection, remediation decision tree, span quality metadata
- post-ocr-text-correction: error taxonomy, confusable tables, noisy channel
n-gram model, regex patterns, hyphenation, layout-based correction pipeline
- presentation-and-spreadsheet-pdfs: detection heuristics, slide structure,
bullet hierarchy, speaker notes, hairline grid detection, sheet boundaries,
cell type inference, Rust output schema
- semantic-text-reconstruction: beam search n-gram reconstruction, NER
validation, domain lexicons, cross-span consistency, abbreviation expansion,
citation repair, coherence scoring, ReconstructedSpan output schema
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-16 15:07:30 -04:00
jedarden
a7673c906f
Add 12 research documents covering full PDF extraction surface
...
Infrastructure and parsing:
- raster-ocr-pipeline: trigger detection, preprocessing, Tesseract integration,
assisted OCR, HOCR alignment, multi-language, performance
- image-and-figure-extraction: XObjects, inline images, filter decoding,
color spaces, geometry, form XObjects, transparency, figure detection
- form-fields-and-annotations: AcroForm types, XFA, widget appearance
streams, rich text, annotation text, output schema
- pdf-encryption-and-security: R2-R6 key derivation, object-level
decryption, permission flags, RustCrypto implementation approach
- page-geometry-and-document-structure: page tree, all five page boxes,
rotation, coordinate inversion, page labels, outlines, named destinations
- optional-content-groups: OCG/OCMD visibility, usage dictionary, default
state resolution, content stream marking, multilingual layer patterns
- invisible-and-hidden-text: all 8 Tr modes, PDF/A invisible layer pattern,
white-on-white, zero-opacity, clipped text, color tracking
- malformed-pdf-repair-and-recovery: xref recovery, stream length repair,
syntax tolerance, partial extraction, structured warnings
Quality and metadata:
- xmp-and-document-metadata: /Info vs XMP, all namespaces, RDF/XML
parsing, conflict resolution, encrypted metadata, thumbnails
- embedded-files-and-portfolios: EmbeddedFile streams, Filespec,
AF relationships, Portfolio detection, ZUGFeRD/Factur-X, security
- performance-and-streaming-architecture: mmap, lazy loading, NDJSON
streaming, rayon parallelism, font caching, axum HTTP server
- benchmark-and-test-methodology: CER/WER/TEDS metrics, corpus
categories, reading order scoring, regression CI, public datasets
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-16 15:05:42 -04:00
jedarden
b805593973
Add six research documents covering output-side extraction topics
...
- table-structure-reconstruction: line detection, gap analysis, Hough
transform, graph-based cell reconstruction, merged cells, multi-page tables
- mathematical-expression-handling: five encoding cases, OpenType MATH table,
symbol font recovery, spatial heuristics, LaTeX reconstruction, fallback tiers
- language-detection-and-script-handling: UAX #24/#9, Arabic/Hebrew bidi,
CJK vertical text, ligature normalization, whatlang/lingua integration
- document-classification-and-zone-labeling: margin heuristics, font
clustering, cross-page recurrence, footnote/caption/sidebar detection
- post-extraction-normalization: hyphen handling, ligature expansion,
paragraph reconstruction, Unicode normalization, pipeline ordering
- chunking-for-llm-consumption: semantic snapping, heading hierarchy,
sliding window overlap, table chunking strategies, token budget, late chunking
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-16 14:56:25 -04:00
jedarden
ef9c03095d
Add SDK architecture notes covering top 10 languages
...
Covers TypeScript, C#, C++, PHP, and Kotlin gaps with full code examples
for both subprocess and HTTP tracks, NuGet RID packaging detail, PHP FFI
options, and implementation sequencing.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-16 14:51:25 -04:00
jedarden
f87579b100
Rewrite README to lead with capabilities, drop competitor references
...
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-16 14:46:33 -04:00
jedarden
c2870e6640
Add research docs and SDK invocation notes
...
Four research documents covering PDF spec fundamentals, font types and
encoding, glyph Unicode recovery, and tagged PDF structure/reading order.
SDK invocation notes with subprocess and HTTP examples for Python, Node.js,
Go, Ruby, Java, Rust, and Bash.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-16 14:33:34 -04:00
jedarden
4ae798c8b1
Initial repo scaffold with README and docs structure
...
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-16 14:26:16 -04:00