pdftract/crates/pdftract-cli
jedarden 2f010c51fb feat(pdftract-206o6): implement scientific_paper profile with fixtures and tests
Author profiles/builtin/scientific_paper.yaml per Phase 7.10 YAML schema:
- Match predicates: text_contains (Abstract, References, doi:, arXiv:, Bibliography)
- Structural predicates: has_math, heading_depth, page_count
- Extraction tuning: xy_cut reading order for 2-column layout
- Fields: title, authors, abstract, doi, journal, publication_date, references

Add 5 fixtures covering diverse scientific paper types:
- arXiv preprint (CC-BY license)
- PLOS ONE journal article
- IEEE-style 2-column paper
- Nature-style single-column with sidebar
- ACM/IEEE conference proceedings

Add comprehensive regression tests in test_scientific_paper.rs:
- Profile schema validation
- Fixture structure verification
- Expected output consistency checks
- Match predicate validation
- Fixture diversity verification
- xy_cut reading order verification
- DOI regex format validation

Co-Authored-By: Claude Code (claude-opus-4-7) <noreply@anthropic.com>
2026-05-27 20:19:10 -04:00
..
benches feat(pdftract-3h9xo): implement threads JSON output + schema integration 2026-05-25 13:40:15 -04:00
src feat(pdftract-4a3je): implement multipart parsing with PDF magic-byte validation 2026-05-27 20:19:10 -04:00
tests feat(pdftract-206o6): implement scientific_paper profile with fixtures and tests 2026-05-27 20:19:10 -04:00
build.rs docs(pdftract-32y9): finalize SDK architecture note with workspace layout, cross-compile matrix, and KU-12 alignment 2026-05-24 06:38:23 -04:00
Cargo.toml feat(pdftract-3b1mk): implement TH-09 inspector XSS test with CSP headers 2026-05-26 20:38:21 -04:00
pdftract-cli.cdx.json feat(pdftract-67tm8): implement MCP stdio transport with integration tests 2026-05-23 00:16:42 -04:00