Implements the slide_deck document profile for PowerPoint/Keynote/Google Slides exports as PDF. Includes 5 fixtures, expected outputs, and regression tests. Components: - profiles/builtin/slide_deck/profile.yaml - Profile configuration - tests/fixtures/profiles/slide_deck/ - 5 PDF fixtures with expected outputs - crates/pdftract-cli/tests/test_slide_deck.rs - Regression tests (12 PASS) Fixtures cover: 1. pitch_deck - Sales pitch (10 slides) 2. academic_lecture - Academic lecture (40 slides) 3. corporate_kickoff - Corporate kickoff (15 slides) 4. bilingual_deck - Bilingual EN/ES (12 slides) 5. googleslides_handout - Google Slides handout mode (4 pages, 3 slides/page) Extracted fields: title, presenter, date, slide_titles Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> |
||
|---|---|---|
| .. | ||
| arxiv_paper-expected.json | ||
| arxiv_paper.pdf | ||
| conference_paper-expected.json | ||
| conference_paper.pdf | ||
| ieee_paper-expected.json | ||
| ieee_paper.pdf | ||
| nature_paper-expected.json | ||
| nature_paper.pdf | ||
| plos_one_paper-expected.json | ||
| plos_one_paper.pdf | ||
| PROVENANCE.md | ||
| README.md | ||
Scientific Paper Profile Fixtures
This directory contains test fixtures for the scientific paper document profile.
Fixture Types
- arxiv_paper - arXiv preprint with CC-BY license, typical academic structure with Abstract, Introduction, Methods, Results, Discussion, References
- plos_one_paper - PLOS ONE journal article with DOI, open access formatting, single-column layout
- ieee_paper - IEEE-style 2-column journal article with mathematical equations, numbered references
- nature_paper - Nature-style single-column article with sidebar layout, Received/Accepted dates
- conference_paper - ACM/IEEE conference proceedings with DOI, author affiliations, structured references
Expected Output Format
Each fixture should have a corresponding *-expected.json file with the following structure:
{
"metadata": {
"document_type": "scientific_paper",
"document_type_confidence": 0.XX,
"document_type_reasons": [...],
"profile_name": "scientific_paper",
"profile_version": "1.0.0",
"profile_fields": {
"title": "...",
"authors": ["..."],
"abstract": "...",
"doi": "...",
"journal": "...",
"publication_date": "YYYY-MM-DD",
"references": "..."
}
}
}
Profile Fields
The scientific paper profile extracts the following fields:
- title: Paper title (region: top_quarter, pick: largest_font)
- authors: Author list (region: top_quarter, pick: nearest_below)
- abstract: Abstract text (near: "Abstract", region: top_half)
- doi: Digital Object Identifier (regex match)
- journal: Journal or publication name (region: top_eighth)
- publication_date: Publication date (near: "Published", "Received", "Accepted")
- references: References section (region: bottom_half, after "References" heading)
Provenance
All fixtures should be sourced from publicly available academic papers with appropriate licenses or created synthetically with clear provenance documentation. See PROVENANCE.md for details on each fixture.
TODO
- Acquire or create PDF files for each fixture type
- Validate extraction accuracy against expected outputs
- Document extraction limitations (e.g., 3-column layouts, unusual author formats)