jedarden 21fcd902d1 feat(pdftract-2vajs): implement slide_deck profile with fixtures and tests

Implements the slide_deck document profile for PowerPoint/Keynote/Google
Slides exports as PDF. Includes 5 fixtures, expected outputs, and regression
tests.

Components:
- profiles/builtin/slide_deck/profile.yaml - Profile configuration
- tests/fixtures/profiles/slide_deck/ - 5 PDF fixtures with expected outputs
- crates/pdftract-cli/tests/test_slide_deck.rs - Regression tests (12 PASS)

Fixtures cover:
1. pitch_deck - Sales pitch (10 slides)
2. academic_lecture - Academic lecture (40 slides)
3. corporate_kickoff - Corporate kickoff (15 slides)
4. bilingual_deck - Bilingual EN/ES (12 slides)
5. googleslides_handout - Google Slides handout mode (4 pages, 3 slides/page)

Extracted fields: title, presenter, date, slide_titles

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

2026-05-27 21:12:24 -04:00

3.8 KiB

Raw Blame History

Bead pdftract-206o6: Scientific Paper Profile Implementation

Summary

Implemented the scientific_paper document profile per Phase 7.10 YAML schema with 5 fixtures and regression tests.

Files Created/Modified

Profile Configuration

profiles/builtin/scientific_paper/profile.yaml - Updated to Phase 7.10 schema with:
- name: scientific_paper
- description: Academic papers from arXiv, journals, conference proceedings
- priority: 30
- match predicates: text_contains (Abstract, References, doi:, arXiv:, Bibliography), heading_matches, structural (has_math, heading_depth, page_count)
- extraction tuning: xy_cut reading order for 2-column layout, readability_threshold 0.5
- fields: title, authors, abstract, doi, journal, publication_date, references

Fixtures (5 expected outputs)

tests/fixtures/profiles/scientific_paper/arxiv_paper-expected.json
tests/fixtures/profiles/scientific_paper/plos_one_paper-expected.json
tests/fixtures/profiles/scientific_paper/ieee_paper-expected.json
tests/fixtures/profiles/scientific_paper/nature_paper-expected.json
tests/fixtures/profiles/scientific_paper/conference_paper-expected.json
tests/fixtures/profiles/scientific_paper/README.md
tests/fixtures/profiles/scientific_paper/PROVENANCE.md

Tests

crates/pdftract-cli/tests/test_scientific_paper.rs - Comprehensive regression tests including:
- Profile schema validation
- Fixture structure verification
- Expected output consistency checks
- Match predicate validation
- Fixture diversity verification
- xy_cut reading order verification
- DOI regex format validation

Acceptance Criteria Status

PASS

profiles/builtin/scientific_paper.yaml validates (follows Phase 7.10 schema)
5+ fixtures with expected outputs (5 fixtures covering arXiv, PLOS ONE, IEEE, Nature, conference proceedings)
tests/profiles/test_scientific_paper.rs passes (10/10 tests passing, 2 skipped integration tests awaiting Phase 7.10 implementation)

Test Results (2026-05-27)

cargo nextest run -p pdftract-cli --test test_scientific_paper
────────────
     Summary [   0.011s] 10 tests run: 10 passed, 2 skipped

All tests pass:

test_doi_regex_format
test_provenance_completeness
test_scientific_paper_profile_exists
test_scientific_paper_profile_schema
test_xy_cut_reading_order
test_scientific_paper_match_predicates
test_fixture_diversity
test_fixture_count
test_scientific_paper_fixture_structure
test_expected_output_consistency

Skipped (awaiting Phase 7.10 profile loader):

integration_tests::test_load_scientific_paper_profile
integration_tests::test_scientific_paper_extraction_accuracy

Profile Fields

Field	Extraction Strategy
title	region: top_quarter, pick: largest_font
authors	region: top_quarter, pick: nearest_below, after: title
abstract	near: ["Abstract"], region: top_half
doi	regex: 'doi[:.]\s*(10.\d{4,9}/[\w-._;()/:]+)', parse: string
journal	region: top_eighth, pick: first
publication_date	near: ["Published", "Received", "Accepted"], parse: date
references	region: bottom_half, after_heading: References

Notes

2-column layout handling via xy_cut reading order is critical for IEEE-style papers
DOI regex matches canonical doi.org format (10.NNNN/...)
Authors extraction captures verbatim author block; downstream parsing handles name decomposition
References extraction is best-effort at v1.0 (single text block from References heading to end)
Math equations handled by Phase 7 OpenType Math (structural.has_math signal)

TODO for Future

Add arxiv_id field for arXiv-specific paper IDs
Per-field accuracy testing once extraction implementation is complete
Classifier corpus evaluation (50-paper subset) for precision/recall metrics

3.8 KiB Raw Blame History