Implements the slide_deck document profile for PowerPoint/Keynote/Google Slides exports as PDF. Includes 5 fixtures, expected outputs, and regression tests. Components: - profiles/builtin/slide_deck/profile.yaml - Profile configuration - tests/fixtures/profiles/slide_deck/ - 5 PDF fixtures with expected outputs - crates/pdftract-cli/tests/test_slide_deck.rs - Regression tests (12 PASS) Fixtures cover: 1. pitch_deck - Sales pitch (10 slides) 2. academic_lecture - Academic lecture (40 slides) 3. corporate_kickoff - Corporate kickoff (15 slides) 4. bilingual_deck - Bilingual EN/ES (12 slides) 5. googleslides_handout - Google Slides handout mode (4 pages, 3 slides/page) Extracted fields: title, presenter, date, slide_titles Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> |
||
|---|---|---|
| .. | ||
| academic_lecture-expected.json | ||
| academic_lecture.pdf | ||
| bilingual_deck-expected.json | ||
| bilingual_deck.pdf | ||
| corporate_kickoff-expected.json | ||
| corporate_kickoff.pdf | ||
| googleslides_handout-expected.json | ||
| googleslides_handout.pdf | ||
| pitch_deck-expected.json | ||
| pitch_deck.pdf | ||
| PROVENANCE.md | ||
| README.md | ||
Slide Deck Profile Fixtures
This directory contains test fixtures for the slide deck document profile.
Fixture Types
- pitch_deck - Sales/product pitch deck (10 slides) with typical startup presentation structure
- academic_lecture - Academic lecture slides (40 slides) with technical content and Q&A slides
- corporate_kickoff - Corporate annual kickoff presentation (15 slides) with business metrics and roadmap
- bilingual_deck - Bilingual English/Spanish presentation (12 slides) testing multilingual extraction
- googleslides_handout - Google Slides handout mode export (3 slides per page, 4 pages total) testing multi-slide-per-page edge case
Expected Output Format
Each fixture should have a corresponding *-expected.json file with the following structure:
{
"metadata": {
"document_type": "slide_deck",
"document_type_confidence": 0.XX,
"document_type_reasons": [...],
"profile_name": "slide_deck",
"profile_version": "1.0.0",
"profile_fields": {
"title": "...",
"presenter": "...",
"date": "YYYY-MM-DD",
"slide_titles": [...]
}
}
}
Profile Fields
The slide deck profile extracts the following fields:
- title: Presentation title (region: middle_half, pick: largest_font)
- presenter: Presenter name (region: bottom_half, pick: largest_font)
- date: Presentation date (near: "Date", parse: date)
- slide_titles: Ordered list of slide titles (pick: largest_font, collected per page)
Known Limitations
- Multi-slide-per-page PDFs (handout mode) are a known limitation: page_count no longer equals slide count
- Slides with image-based titles or icons will not extract slide titles correctly
- Presenter extraction often fails when slides include logos or affiliations with names
- Non-English presentations may have reduced extraction accuracy
- Google Slides exports vary in structure depending on export settings
- Beamer (LaTeX) exports have very different structural signals
Provenance
All fixtures are sourced from synthetic templates created for testing purposes. See PROVENANCE.md for details on each fixture.