Implements the slide_deck document profile for PowerPoint/Keynote/Google Slides exports as PDF. Includes 5 fixtures, expected outputs, and regression tests. Components: - profiles/builtin/slide_deck/profile.yaml - Profile configuration - tests/fixtures/profiles/slide_deck/ - 5 PDF fixtures with expected outputs - crates/pdftract-cli/tests/test_slide_deck.rs - Regression tests (12 PASS) Fixtures cover: 1. pitch_deck - Sales pitch (10 slides) 2. academic_lecture - Academic lecture (40 slides) 3. corporate_kickoff - Corporate kickoff (15 slides) 4. bilingual_deck - Bilingual EN/ES (12 slides) 5. googleslides_handout - Google Slides handout mode (4 pages, 3 slides/page) Extracted fields: title, presenter, date, slide_titles Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
55 lines
2.1 KiB
Markdown
55 lines
2.1 KiB
Markdown
# Slide Deck Profile Fixtures
|
|
|
|
This directory contains test fixtures for the slide deck document profile.
|
|
|
|
## Fixture Types
|
|
|
|
1. **pitch_deck** - Sales/product pitch deck (10 slides) with typical startup presentation structure
|
|
2. **academic_lecture** - Academic lecture slides (40 slides) with technical content and Q&A slides
|
|
3. **corporate_kickoff** - Corporate annual kickoff presentation (15 slides) with business metrics and roadmap
|
|
4. **bilingual_deck** - Bilingual English/Spanish presentation (12 slides) testing multilingual extraction
|
|
5. **googleslides_handout** - Google Slides handout mode export (3 slides per page, 4 pages total) testing multi-slide-per-page edge case
|
|
|
|
## Expected Output Format
|
|
|
|
Each fixture should have a corresponding `*-expected.json` file with the following structure:
|
|
|
|
```json
|
|
{
|
|
"metadata": {
|
|
"document_type": "slide_deck",
|
|
"document_type_confidence": 0.XX,
|
|
"document_type_reasons": [...],
|
|
"profile_name": "slide_deck",
|
|
"profile_version": "1.0.0",
|
|
"profile_fields": {
|
|
"title": "...",
|
|
"presenter": "...",
|
|
"date": "YYYY-MM-DD",
|
|
"slide_titles": [...]
|
|
}
|
|
}
|
|
}
|
|
```
|
|
|
|
## Profile Fields
|
|
|
|
The slide deck profile extracts the following fields:
|
|
|
|
- **title**: Presentation title (region: middle_half, pick: largest_font)
|
|
- **presenter**: Presenter name (region: bottom_half, pick: largest_font)
|
|
- **date**: Presentation date (near: "Date", parse: date)
|
|
- **slide_titles**: Ordered list of slide titles (pick: largest_font, collected per page)
|
|
|
|
## Known Limitations
|
|
|
|
- Multi-slide-per-page PDFs (handout mode) are a known limitation: page_count no longer equals slide count
|
|
- Slides with image-based titles or icons will not extract slide titles correctly
|
|
- Presenter extraction often fails when slides include logos or affiliations with names
|
|
- Non-English presentations may have reduced extraction accuracy
|
|
- Google Slides exports vary in structure depending on export settings
|
|
- Beamer (LaTeX) exports have very different structural signals
|
|
|
|
## Provenance
|
|
|
|
All fixtures are sourced from synthetic templates created for testing purposes. See PROVENANCE.md for details on each fixture.
|