The extract_markdown stub was calling extract_text instead of
extract_text_fn, causing a compilation error. This fixes the
function name to match the exported function from extract_text.rs.
This completes the extract_text PyO3 entry point implementation,
which was already present in extract_text.rs and lib.rs.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
The emit! macro expects diagnostic codes without the DiagCode:: prefix.
Changed three occurrences in codespace.rs:
- Line 281: DiagCode::CmapInvalidCodespace → CmapInvalidCodespace
- Line 290: DiagCode::CmapInvalidCodespace → CmapInvalidCodespace
- Line 412: DiagCode::CmapInvalidCodespace → CmapInvalidCodespace
This fixes compilation errors that prevented the codebase from building.
The --pages, --header, and URL credential parsing features are fully
implemented in pages.rs, header.rs, and url.rs modules with comprehensive
tests and integration in main.rs, grep/mod.rs, and hash.rs.
References: pdftract-25igv, notes/pdftract-25igv.md
Add explicit enum constraints to page_type, severity, and confidence_source
fields in the generated JSON Schema for better validation.
Changes:
- Modified xtask/src/bin/gen_schema.rs to add explicit enum constraints
during schema generation via add_enum_constraints() function
- page_type enum: ["text", "scanned", "mixed", "broken_vector", "blank", "figure_only"]
- severity enum: ["info", "warning", "error", "fatal"]
- confidence_source enum: ["native", "heuristic", "ocr"]
- Regenerated docs/schema/v1.0/pdftract.schema.json with enum constraints
- Added .github/workflows/schema-gen.yml CI workflow for schema validation
The CI workflow validates:
1. Generated schema matches committed file (fails on diff)
2. JSON syntax is valid
3. Schema structure is correct ($id, $schema, title, $defs)
4. Enum constraints are present and have correct values
This ensures schema changes are reviewable in PRs and forces
developers to commit the updated schema when type definitions change.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Implement the xtask gen-schema binary at xtask/src/bin/gen_schema.rs that
derives JSON Schema Draft 2020-12 from the Rust ExtractionResult type via
the schemars crate.
Changes:
- Add stable key sorting (sort_keys_recursive) for byte-identical output
- Set $id to stable URL: https://pdftract.com/schema/v1.0/pdftract.schema.json
- Set title to "pdftract Output v1.0"
- Add cargo alias `gen-schema` for convenient invocation
- Emit schema to docs/schema/v1.0/pdftract.schema.json
The schema is generated from the Rust types with schemars derives, ensuring
the JSON schema is always in sync with the source types.
Acceptance criteria:
- cargo gen-schema regenerates docs/schema/v1.0/pdftract.schema.json
- Generated schema validates against JSON Schema Draft 2020-12
- Schema $id is the stable URL
- Title is "pdftract Output v1.0"
- Stable ordering: regenerating twice produces byte-identical output
- All expected types appear in $defs (BlockJson, SpanJson, PageResult, etc.)
Note: page_type and confidence_source enums are not yet implemented in the
Rust types (marked as TODO in schema/mod.rs). These will be added by sibling
beads pdftract-1ob and pdftract-1f8we respectively.
Closes: pdftract-5nv9h
Add cargo xtask gen-shape-db command that walks font directories,
rasterizes glyphs at 32x32 via fontdue, computes pHash, and outputs
build/glyph-shapes.json.
Implementation details:
- Fontdue integration for TrueType/OpenType font loading
- 32x32 bitmap rasterization with centering
- DCT-based pHash computation (32x32 DCT → 8x8 low-freq → median threshold)
- Character frequency data for collision resolution
- Deduplication by (phash, char) pairs
- Cross-character collision handling (keep higher-frequency char)
- Sorted output by pHash ascending
Artifacts:
- build/frequency.json: Character frequency rankings
- build/README.md: Command documentation and usage
Acceptance criteria:
- ✅ cargo xtask gen-shape-db --fonts <dir> produces valid JSON
- ✅ Deterministic output (byte-identical on same inputs)
- ✅ Fontdue integration and 32x32 rasterization
- ✅ pHash computation via DCT
- ⚠️ No system fonts for full integration test (documented)
Closes: pdftract-2aq0