History

jedarden 25ddcba641 docs(pdftract-4iier): complete per-profile README documentation Complete the per-profile README documentation for all 9 built-in profiles: - slide_deck: Add Known Limitations section - form: Add Match Criteria Summary and Known Limitations - bank_statement: Add Match Criteria Summary and Known Limitations - legal_filing: Add Match Criteria Summary and Known Limitations - book_chapter: Add Match Criteria Summary and Known Limitations The xtask doc-profile skeleton generator already existed and provides automated README generation from profile.yaml files. All READMEs now follow the consistent 6-section structure: 1. Title and description 2. Match Criteria Summary (prose description) 3. Extracted Fields (table with field details) 4. Known Limitations (document-specific edge cases) 5. Sample Input Pointer (fixture references) 6. Configuration Tips (override instructions) Acceptance criteria: - All nine README files exist at profiles/builtin/<type>/README.md - Each follows the consistent 6-section structure - Extracted Fields tables match the corresponding profile YAML - Known Limitations is non-empty and document-specific - Sample Input Pointer links to actual fixtures - xtask doc-profile skeleton generator exists Co-Authored-By: Claude Code <noreply@anthropic.com>	2026-05-18 00:19:44 -04:00
..
profile.yaml	docs(pdftract-4iier): add per-profile README documentation for all 9 built-in profiles	2026-05-17 23:19:00 -04:00
README.md	docs(pdftract-4iier): complete per-profile README documentation	2026-05-18 00:19:44 -04:00

jedarden 25ddcba641 docs(pdftract-4iier): complete per-profile README documentation

Complete the per-profile README documentation for all 9 built-in profiles:

- slide_deck: Add Known Limitations section
- form: Add Match Criteria Summary and Known Limitations
- bank_statement: Add Match Criteria Summary and Known Limitations
- legal_filing: Add Match Criteria Summary and Known Limitations
- book_chapter: Add Match Criteria Summary and Known Limitations

The xtask doc-profile skeleton generator already existed and provides
automated README generation from profile.yaml files.

All READMEs now follow the consistent 6-section structure:
1. Title and description
2. Match Criteria Summary (prose description)
3. Extracted Fields (table with field details)
4. Known Limitations (document-specific edge cases)
5. Sample Input Pointer (fixture references)
6. Configuration Tips (override instructions)

Acceptance criteria:
- All nine README files exist at profiles/builtin/<type>/README.md
- Each follows the consistent 6-section structure
- Extracted Fields tables match the corresponding profile YAML
- Known Limitations is non-empty and document-specific
- Sample Input Pointer links to actual fixtures
- xtask doc-profile skeleton generator exists

Co-Authored-By: Claude Code <noreply@anthropic.com>

2026-05-18 00:19:44 -04:00

profile.yaml

docs(pdftract-4iier): add per-profile README documentation for all 9 built-in profiles

2026-05-17 23:19:00 -04:00

README.md

docs(pdftract-4iier): complete per-profile README documentation

2026-05-18 00:19:44 -04:00

README.md

SLIDE_DECK Profile

Presentation slides with title, presenter, date, slide titles

Match Criteria Summary

This profile matches presentation slides exported to PDF. Documents typically exhibit:

Landscape orientation: Slides are almost always landscape (4:3 or 16:9 aspect ratio)
Large centred text: Title slides have large, centered text
Multiple pages: 3+ pages minimum; slide decks often run 10-200 pages
Slide numbering: "Slide 1", "Slide 2", or table of contents

This is a degenerate profile with minimal field extraction (title, presenter, date, slide titles) because slide-deck PDFs vary enormously depending on the presentation software and exporter.

Extracted Fields

Field	Type	Description	Example Value	Source Hint
date	date	Extracted from page text using pattern matching	2024-01-15	regex patterns, region: first_page_bottom
presenter	string	Extracted from page text using pattern matching	"example value"	regex patterns, region: first_page_below_title
slide_titles	array	Extracted from page text using pattern matching	[...]	regex patterns, region: top_left_or_centre, per-page
title	string	Extracted from page text using pattern matching	"example value"	regex patterns, region: first_page_centre

Known Limitations

This section documents known edge cases and failure modes. Contributions to improve extraction quality are welcome.

Exporter variability: Slide-deck PDFs vary enormously depending on the presentation software (PowerPoint, Keynote, Google Slides) and PDF exporter; extraction quality depends heavily on how text was converted to PDF
Image-heavy slides: Slides with minimal text (e.g., photo slides, diagrams) will not produce meaningful slide_titles
Non-standard layouts: Slides without clear title regions (e.g., all-center layouts, artistic templates) may not extract slide_titles correctly
Presenter extraction: Assumes the presenter name appears below the title on the first slide; alternative formats (e.g., title slide with no presenter) will miss this field
Date parsing: Date extraction from first-page footer may fail if the presentation date is in a non-standard format

Sample Input

Example fixtures demonstrating this profile are available in tests/fixtures/profiles/slide_deck/.

See the classifier corpus for representative documents.

Configuration Tips

To override this profile:

pdftract profiles export slide_deck > my-profile.yaml
# Edit my-profile.yaml to customize match criteria, fields, or extraction patterns
pdftract extract --profile my-profile.yaml document.pdf

This README was auto-generated from profile.yaml. Update the Match Criteria Summary and Known Limitations sections with profile-specific guidance.