pdftract/profiles/builtin/slide_deck
jedarden 25ddcba641 docs(pdftract-4iier): complete per-profile README documentation
Complete the per-profile README documentation for all 9 built-in profiles:

- slide_deck: Add Known Limitations section
- form: Add Match Criteria Summary and Known Limitations
- bank_statement: Add Match Criteria Summary and Known Limitations
- legal_filing: Add Match Criteria Summary and Known Limitations
- book_chapter: Add Match Criteria Summary and Known Limitations

The xtask doc-profile skeleton generator already existed and provides
automated README generation from profile.yaml files.

All READMEs now follow the consistent 6-section structure:
1. Title and description
2. Match Criteria Summary (prose description)
3. Extracted Fields (table with field details)
4. Known Limitations (document-specific edge cases)
5. Sample Input Pointer (fixture references)
6. Configuration Tips (override instructions)

Acceptance criteria:
- All nine README files exist at profiles/builtin/<type>/README.md
- Each follows the consistent 6-section structure
- Extracted Fields tables match the corresponding profile YAML
- Known Limitations is non-empty and document-specific
- Sample Input Pointer links to actual fixtures
- xtask doc-profile skeleton generator exists

Co-Authored-By: Claude Code <noreply@anthropic.com>
2026-05-18 00:19:44 -04:00
..
profile.yaml docs(pdftract-4iier): add per-profile README documentation for all 9 built-in profiles 2026-05-17 23:19:00 -04:00
README.md docs(pdftract-4iier): complete per-profile README documentation 2026-05-18 00:19:44 -04:00

SLIDE_DECK Profile

Presentation slides with title, presenter, date, slide titles

Match Criteria Summary

This profile matches presentation slides exported to PDF. Documents typically exhibit:

  • Landscape orientation: Slides are almost always landscape (4:3 or 16:9 aspect ratio)
  • Large centred text: Title slides have large, centered text
  • Multiple pages: 3+ pages minimum; slide decks often run 10-200 pages
  • Slide numbering: "Slide 1", "Slide 2", or table of contents

This is a degenerate profile with minimal field extraction (title, presenter, date, slide titles) because slide-deck PDFs vary enormously depending on the presentation software and exporter.

Extracted Fields

Field Type Description Example Value Source Hint
date date Extracted from page text using pattern matching 2024-01-15 regex patterns, region: first_page_bottom
presenter string Extracted from page text using pattern matching "example value" regex patterns, region: first_page_below_title
slide_titles array Extracted from page text using pattern matching [...] regex patterns, region: top_left_or_centre, per-page
title string Extracted from page text using pattern matching "example value" regex patterns, region: first_page_centre

Known Limitations

This section documents known edge cases and failure modes. Contributions to improve extraction quality are welcome.

  • Exporter variability: Slide-deck PDFs vary enormously depending on the presentation software (PowerPoint, Keynote, Google Slides) and PDF exporter; extraction quality depends heavily on how text was converted to PDF
  • Image-heavy slides: Slides with minimal text (e.g., photo slides, diagrams) will not produce meaningful slide_titles
  • Non-standard layouts: Slides without clear title regions (e.g., all-center layouts, artistic templates) may not extract slide_titles correctly
  • Presenter extraction: Assumes the presenter name appears below the title on the first slide; alternative formats (e.g., title slide with no presenter) will miss this field
  • Date parsing: Date extraction from first-page footer may fail if the presentation date is in a non-standard format

Sample Input

Example fixtures demonstrating this profile are available in tests/fixtures/profiles/slide_deck/.

See the classifier corpus for representative documents.

Configuration Tips

To override this profile:

pdftract profiles export slide_deck > my-profile.yaml
# Edit my-profile.yaml to customize match criteria, fields, or extraction patterns
pdftract extract --profile my-profile.yaml document.pdf

This README was auto-generated from profile.yaml. Update the Match Criteria Summary and Known Limitations sections with profile-specific guidance.