jedarden/pdftract

10 commits 1 branch 0 tags 7.5 GiB

Author	SHA1	Message	Date
jedarden	116db89c95	Add three research documents on routing and text reconstruction - word-boundary-reconstruction: expected position formula with Tc/Tw/Tz, TJ kerning gap detection, Td/Tm jump analysis, four space-width threshold strategies including adaptive histogram, multi-column gap discrimination - scanned-vs-vector-page-classification: four-category taxonomy, fast pre-checks, image coverage AABB computation, character density ratio, validity rate, glyph bbox plausibility, region routing map, confidence scoring with cost-aware OCR threshold - pdfa-compliance-and-extraction: ISO 19005 part/level matrix, XMP pdfaid detection, Level B/U/A guarantee implications for extraction, font embedding requirements, artifact tagging, PDF/A-3 embedded files, PdfaLevel enum with per-level fast-path branching Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-16 15:22:08 -04:00
jedarden	9420964b73	Add three research documents on parser correctness fundamentals - graphics-state-tracking: full q/Q stack, text state operators, color space tracking, ExtGState keys, clip path management, CTM concatenation, blend mode/soft mask visibility, Form XObject isolation, GraphicsState Rust struct with is_text_visible implementation - cmap-format-and-cid-encoding: CMap file structure, codespace range scan grammar, bfchar/bfrange/cidchar/cidrange semantics, usecmap inheritance with predefined CJK CMap inventory, mixed-length parsing state machine, ToUnicode defect handling, Rust CMap struct design - content-stream-concatenation: multi-stream concatenation with 0x0A injection, continuous graphics state across boundaries, resource inheritance page-tree walk, Form XObject and Type 3 resource isolation, ResourceStack design, EI disambiguation in binary data, lazy decompression Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-16 15:16:41 -04:00
jedarden	f805e52fa3	Add four research documents focused on readable text production - type3-font-extraction: CharProcs stream parsing, TeX/dvips naming conventions, dHash shape fingerprinting, nested font stacks, OCR fallback - watermark-and-background-separation: five PDF watermark mechanisms, transparency tracking, cross-page repetition, WCAG contrast detection, raster inpainting, diagonal watermark removal pipeline - historical-and-degraded-document-extraction: eight degradation categories, bleed-through removal, illumination correction, Sauvola binarization, stroke reconstruction, Fraktur/long-s handling, confidence-gated output - complex-layout-reading-order: baseline clustering, XY-cut, Docstrum, RLSA smearing, mixed-layout detection, sidebar/inset/footnote ordering, perplexity-based confidence with natural_order fallback Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-16 15:13:10 -04:00
jedarden	31e715633d	Add four research documents on text quality and document-type handling - text-readability-validation: character/word/entropy/perplexity checks, symbol font detection, remediation decision tree, span quality metadata - post-ocr-text-correction: error taxonomy, confusable tables, noisy channel n-gram model, regex patterns, hyphenation, layout-based correction pipeline - presentation-and-spreadsheet-pdfs: detection heuristics, slide structure, bullet hierarchy, speaker notes, hairline grid detection, sheet boundaries, cell type inference, Rust output schema - semantic-text-reconstruction: beam search n-gram reconstruction, NER validation, domain lexicons, cross-span consistency, abbreviation expansion, citation repair, coherence scoring, ReconstructedSpan output schema Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-16 15:07:30 -04:00
jedarden	a7673c906f	Add 12 research documents covering full PDF extraction surface Infrastructure and parsing: - raster-ocr-pipeline: trigger detection, preprocessing, Tesseract integration, assisted OCR, HOCR alignment, multi-language, performance - image-and-figure-extraction: XObjects, inline images, filter decoding, color spaces, geometry, form XObjects, transparency, figure detection - form-fields-and-annotations: AcroForm types, XFA, widget appearance streams, rich text, annotation text, output schema - pdf-encryption-and-security: R2-R6 key derivation, object-level decryption, permission flags, RustCrypto implementation approach - page-geometry-and-document-structure: page tree, all five page boxes, rotation, coordinate inversion, page labels, outlines, named destinations - optional-content-groups: OCG/OCMD visibility, usage dictionary, default state resolution, content stream marking, multilingual layer patterns - invisible-and-hidden-text: all 8 Tr modes, PDF/A invisible layer pattern, white-on-white, zero-opacity, clipped text, color tracking - malformed-pdf-repair-and-recovery: xref recovery, stream length repair, syntax tolerance, partial extraction, structured warnings Quality and metadata: - xmp-and-document-metadata: /Info vs XMP, all namespaces, RDF/XML parsing, conflict resolution, encrypted metadata, thumbnails - embedded-files-and-portfolios: EmbeddedFile streams, Filespec, AF relationships, Portfolio detection, ZUGFeRD/Factur-X, security - performance-and-streaming-architecture: mmap, lazy loading, NDJSON streaming, rayon parallelism, font caching, axum HTTP server - benchmark-and-test-methodology: CER/WER/TEDS metrics, corpus categories, reading order scoring, regression CI, public datasets Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-16 15:05:42 -04:00
jedarden	b805593973	Add six research documents covering output-side extraction topics - table-structure-reconstruction: line detection, gap analysis, Hough transform, graph-based cell reconstruction, merged cells, multi-page tables - mathematical-expression-handling: five encoding cases, OpenType MATH table, symbol font recovery, spatial heuristics, LaTeX reconstruction, fallback tiers - language-detection-and-script-handling: UAX #24/#9, Arabic/Hebrew bidi, CJK vertical text, ligature normalization, whatlang/lingua integration - document-classification-and-zone-labeling: margin heuristics, font clustering, cross-page recurrence, footnote/caption/sidebar detection - post-extraction-normalization: hyphen handling, ligature expansion, paragraph reconstruction, Unicode normalization, pipeline ordering - chunking-for-llm-consumption: semantic snapping, heading hierarchy, sliding window overlap, table chunking strategies, token budget, late chunking Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-16 14:56:25 -04:00
jedarden	ef9c03095d	Add SDK architecture notes covering top 10 languages Covers TypeScript, C#, C++, PHP, and Kotlin gaps with full code examples for both subprocess and HTTP tracks, NuGet RID packaging detail, PHP FFI options, and implementation sequencing. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-16 14:51:25 -04:00
jedarden	f87579b100	Rewrite README to lead with capabilities, drop competitor references Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-16 14:46:33 -04:00
jedarden	c2870e6640	Add research docs and SDK invocation notes Four research documents covering PDF spec fundamentals, font types and encoding, glyph Unicode recovery, and tagged PDF structure/reading order. SDK invocation notes with subprocess and HTTP examples for Python, Node.js, Go, Ruby, Java, Rust, and Bash. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-16 14:33:34 -04:00
jedarden	4ae798c8b1	Initial repo scaffold with README and docs structure Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-16 14:26:16 -04:00