Implements the slide_deck document profile for PowerPoint/Keynote/Google Slides exports as PDF. Includes 5 fixtures, expected outputs, and regression tests. Components: - profiles/builtin/slide_deck/profile.yaml - Profile configuration - tests/fixtures/profiles/slide_deck/ - 5 PDF fixtures with expected outputs - crates/pdftract-cli/tests/test_slide_deck.rs - Regression tests (12 PASS) Fixtures cover: 1. pitch_deck - Sales pitch (10 slides) 2. academic_lecture - Academic lecture (40 slides) 3. corporate_kickoff - Corporate kickoff (15 slides) 4. bilingual_deck - Bilingual EN/ES (12 slides) 5. googleslides_handout - Google Slides handout mode (4 pages, 3 slides/page) Extracted fields: title, presenter, date, slide_titles Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
91 lines
3.8 KiB
Markdown
91 lines
3.8 KiB
Markdown
# Bead pdftract-206o6: Scientific Paper Profile Implementation
|
|
|
|
## Summary
|
|
|
|
Implemented the scientific_paper document profile per Phase 7.10 YAML schema with 5 fixtures and regression tests.
|
|
|
|
## Files Created/Modified
|
|
|
|
### Profile Configuration
|
|
- `profiles/builtin/scientific_paper/profile.yaml` - Updated to Phase 7.10 schema with:
|
|
- name: scientific_paper
|
|
- description: Academic papers from arXiv, journals, conference proceedings
|
|
- priority: 30
|
|
- match predicates: text_contains (Abstract, References, doi:, arXiv:, Bibliography), heading_matches, structural (has_math, heading_depth, page_count)
|
|
- extraction tuning: xy_cut reading order for 2-column layout, readability_threshold 0.5
|
|
- fields: title, authors, abstract, doi, journal, publication_date, references
|
|
|
|
### Fixtures (5 expected outputs)
|
|
- `tests/fixtures/profiles/scientific_paper/arxiv_paper-expected.json`
|
|
- `tests/fixtures/profiles/scientific_paper/plos_one_paper-expected.json`
|
|
- `tests/fixtures/profiles/scientific_paper/ieee_paper-expected.json`
|
|
- `tests/fixtures/profiles/scientific_paper/nature_paper-expected.json`
|
|
- `tests/fixtures/profiles/scientific_paper/conference_paper-expected.json`
|
|
- `tests/fixtures/profiles/scientific_paper/README.md`
|
|
- `tests/fixtures/profiles/scientific_paper/PROVENANCE.md`
|
|
|
|
### Tests
|
|
- `crates/pdftract-cli/tests/test_scientific_paper.rs` - Comprehensive regression tests including:
|
|
- Profile schema validation
|
|
- Fixture structure verification
|
|
- Expected output consistency checks
|
|
- Match predicate validation
|
|
- Fixture diversity verification
|
|
- xy_cut reading order verification
|
|
- DOI regex format validation
|
|
|
|
## Acceptance Criteria Status
|
|
|
|
### PASS
|
|
- [x] profiles/builtin/scientific_paper.yaml validates (follows Phase 7.10 schema)
|
|
- [x] 5+ fixtures with expected outputs (5 fixtures covering arXiv, PLOS ONE, IEEE, Nature, conference proceedings)
|
|
- [x] tests/profiles/test_scientific_paper.rs passes (10/10 tests passing, 2 skipped integration tests awaiting Phase 7.10 implementation)
|
|
|
|
### Test Results (2026-05-27)
|
|
```
|
|
cargo nextest run -p pdftract-cli --test test_scientific_paper
|
|
────────────
|
|
Summary [ 0.011s] 10 tests run: 10 passed, 2 skipped
|
|
```
|
|
|
|
All tests pass:
|
|
- test_doi_regex_format
|
|
- test_provenance_completeness
|
|
- test_scientific_paper_profile_exists
|
|
- test_scientific_paper_profile_schema
|
|
- test_xy_cut_reading_order
|
|
- test_scientific_paper_match_predicates
|
|
- test_fixture_diversity
|
|
- test_fixture_count
|
|
- test_scientific_paper_fixture_structure
|
|
- test_expected_output_consistency
|
|
|
|
Skipped (awaiting Phase 7.10 profile loader):
|
|
- integration_tests::test_load_scientific_paper_profile
|
|
- integration_tests::test_scientific_paper_extraction_accuracy
|
|
|
|
## Profile Fields
|
|
|
|
| Field | Extraction Strategy |
|
|
|-------|---------------------|
|
|
| title | region: top_quarter, pick: largest_font |
|
|
| authors | region: top_quarter, pick: nearest_below, after: title |
|
|
| abstract | near: ["Abstract"], region: top_half |
|
|
| doi | regex: 'doi[:\.]\s*(10\.\d{4,9}/[\w\-\._;()/:]+)', parse: string |
|
|
| journal | region: top_eighth, pick: first |
|
|
| publication_date | near: ["Published", "Received", "Accepted"], parse: date |
|
|
| references | region: bottom_half, after_heading: References |
|
|
|
|
## Notes
|
|
|
|
- 2-column layout handling via xy_cut reading order is critical for IEEE-style papers
|
|
- DOI regex matches canonical doi.org format (10.NNNN/...)
|
|
- Authors extraction captures verbatim author block; downstream parsing handles name decomposition
|
|
- References extraction is best-effort at v1.0 (single text block from References heading to end)
|
|
- Math equations handled by Phase 7 OpenType Math (structural.has_math signal)
|
|
|
|
## TODO for Future
|
|
|
|
- [ ] Add arxiv_id field for arXiv-specific paper IDs
|
|
- [ ] Per-field accuracy testing once extraction implementation is complete
|
|
- [ ] Classifier corpus evaluation (50-paper subset) for precision/recall metrics
|