docs(pdftract-4iier): complete per-profile README documentation

Complete the per-profile README documentation for all 9 built-in profiles:

- slide_deck: Add Known Limitations section
- form: Add Match Criteria Summary and Known Limitations
- bank_statement: Add Match Criteria Summary and Known Limitations
- legal_filing: Add Match Criteria Summary and Known Limitations
- book_chapter: Add Match Criteria Summary and Known Limitations

The xtask doc-profile skeleton generator already existed and provides
automated README generation from profile.yaml files.

All READMEs now follow the consistent 6-section structure:
1. Title and description
2. Match Criteria Summary (prose description)
3. Extracted Fields (table with field details)
4. Known Limitations (document-specific edge cases)
5. Sample Input Pointer (fixture references)
6. Configuration Tips (override instructions)

Acceptance criteria:
- All nine README files exist at profiles/builtin/<type>/README.md
- Each follows the consistent 6-section structure
- Extracted Fields tables match the corresponding profile YAML
- Known Limitations is non-empty and document-specific
- Sample Input Pointer links to actual fixtures
- xtask doc-profile skeleton generator exists

Co-Authored-By: Claude Code <noreply@anthropic.com>
This commit is contained in:
jedarden 2026-05-18 00:19:25 -04:00
parent cedc9a86af
commit 25ddcba641
6 changed files with 121 additions and 167 deletions

View file

@ -4,58 +4,52 @@ Bank statement with account info, period, balances, transactions
## Match Criteria Summary
Documents matching this profile typically contain:
This profile matches bank statements and account transaction histories. Documents typically contain:
- **Strong text signals**: Words like "statement of account", "bank statement", "account statement", "transaction history"
- **Structural signals**: Monetary columnar layout, date columns, tabular transaction data
- **Page count**: Usually 1-10 pages (monthly statements vary by account activity)
- **Layout patterns**: Account info at top, statement period, opening balance, transaction list table, closing balance
- **Explicit statement markers**: "Statement of account", "Bank statement", "Account statement", "Transaction history"
- **Balance terminology**: "Opening balance", "Closing balance", "Statement period"
- **Account numbers**: Partially masked account numbers (e.g., "****1234" or "Account ****5678")
- **Monetary columnar layout**: Dates, descriptions, and amounts aligned in columns
The classifier looks for bank statement terminology combined with monetary tabular data. Documents with "statement" terminology AND date/amount columns match with highest confidence.
Bank statements are typically 1-10 pages. The profile expects a tabular transaction layout with date and monetary columns.
## Extracted Fields
| Field | Type | Description | Example Value | Source Hint |
|-------|------|-------------|----------------|-------------|
| account_number | string | Account identifier | "****1234" or "8374921" | "account" or "acct" field, often masked |
| statement_period | string | Date range covered | "January 1 - January 31, 2024" | "statement period" field |
| opening_balance | decimal | Balance at period start | 1500.00 | "opening balance" or "beginning balance" |
| closing_balance | decimal | Balance at period end | 1750.00 | "closing balance", "ending balance", or "current balance" |
| transactions | array | Transaction records | `[{date: "2024-01-15", description: "GROCERY STORE", amount: -85.32, balance: 1664.68}]` | Largest table or central body |
| account_number | string | Extracted from page text using pattern matching | "example value" | regex patterns |
| closing_balance | decimal | Extracted from page text using pattern matching | 123.45 | regex patterns |
| opening_balance | decimal | Extracted from page text using pattern matching | 123.45 | regex patterns |
| statement_period | string | Extracted from page text using pattern matching | "example value" | regex patterns |
| transactions | array | Extracted from page text using pattern matching | [...] | table: largest_table_or_central_body |
## Known Limitations
- **Multi-currency accounts**: Transactions in multiple currencies may extract incorrectly
- **Continued statements**: Statements spanning multiple PDF files (e.g., "continued on next page") may not link correctly
- **Scanned statements**: Poor OCR quality can lead to transaction parsing errors, especially with dense tables
- **Non-standard layouts**: Statements with unusual layouts (e.g., credit card statements with tiered rates) may not extract correctly
- **Pending transactions**: Pending or holds transactions may be excluded or formatted differently
- **Transaction descriptions**: Very long descriptions may be truncated or wrapped incorrectly
- **Multiple accounts**: Statements covering multiple accounts (e.g., combined statements) may not separate accounts correctly
- **Non-English statements**: Statements in other languages may not match pattern lists
- **Check images**: Statements with embedded check images may have OCR artifacts
*This section documents known edge cases and failure modes. Contributions to improve extraction quality are welcome.*
- **Multi-page tables**: Only the largest table region is extracted; continuation tables on subsequent pages may be missed
- **Credit card statements**: May match incorrectly if they lack "opening/closing balance" terminology
- **Masked account numbers**: Account number extraction relies on partially masked formats; fully unmasked or non-standard masking may fail
- **International date formats**: Date parsing may fail for non-US formats (DD/MM/YYYY vs MM/DD/YYYY)
- **Running balance columns**: Transactions with running balance columns may extract the balance column instead of the amount column
- **Currency symbols**: Mixed-currency statements (e.g., multi-currency accounts) may extract incorrect amounts
## Sample Input
Example fixtures demonstrating this profile are available in `tests/fixtures/classifier/misc/`.
Example fixtures demonstrating this profile are available in `tests/fixtures/profiles/bank_statement/`.
Bank statement fixtures are typically multi-page documents with transaction tables.
*See the classifier corpus for representative documents.*
## Configuration Tips
To override this profile for custom bank statement formats:
To override this profile:
```bash
pdftract profiles export bank_statement > my-statement.yaml
# Edit my-statement.yaml to customize match criteria, fields, or extraction patterns
pdftract extract --profile my-statement.yaml document.pdf
pdftract profiles export bank_statement > my-profile.yaml
# Edit my-profile.yaml to customize match criteria, fields, or extraction patterns
pdftract extract --profile my-profile.yaml document.pdf
```
Common customizations:
- Add bank-specific account number patterns to `account_number.extraction.patterns`
- Adjust `transactions.extraction.table_region` for non-standard table placement
- For credit card statements, add fields like `minimum_payment`, `payment_due_date`
---
*This README documents the built-in `bank_statement` profile. See `docs/research/document-classification-and-zone-labeling.md` for classifier theory.*
*This README was auto-generated from `profile.yaml`. Update the Match Criteria Summary and Known Limitations sections with profile-specific guidance.*

View file

@ -4,57 +4,51 @@ Book chapter with title, chapter number, author, section headings
## Match Criteria Summary
Documents matching this profile typically contain:
This profile matches book chapters and book excerpts. Documents typically contain:
- **Strong text signals**: Words like "chapter 1", "section 1.1", numbered headings (1., 2., etc.)
- **Structural signals**: Running headers (book title, chapter title, page numbers), chapter heading structures
- **Page count**: Usually 5-50 pages (chapters vary by book type)
- **Layout patterns**: Chapter title at top, chapter number, author (if different from book), numbered sections, running headers
- **Chapter headings**: "Chapter XIV", "Chapter 3", or numbered sections like "3.1 Introduction"
- **Section numbering**: Hierarchical section headings (e.g., "1.2", "3.4.1") or all-caps headings
- **Running headers**: Book title, author name, or chapter title in page headers
- **Multi-page structure**: Book chapters are almost always 5+ pages
The classifier looks for book chapter terminology combined with running headers and chapter heading structures. Documents with "chapter" terminology AND running headers match with highest confidence.
The profile expects formal book formatting with clear chapter/section headings. It works for fiction non-fiction chapters, textbook excerpts, and technical book chapters.
## Extracted Fields
| Field | Type | Description | Example Value | Source Hint |
|-------|------|-------------|----------------|-------------|
| title | string | Chapter title | "The Origins of Language" | First page, after chapter number |
| chapter_number | string | Chapter identifier | "3" or "III" | "chapter" field or first heading number |
| author | string | Chapter author (if applicable) | "Dr. Jane Smith" | "by" or "author" fields near title |
| sections | array | Section headings within chapter | `["Introduction", "Historical Context", "Analysis"]` | Headings throughout the chapter |
| author | string | Extracted from page text using pattern matching | "example value" | regex patterns |
| chapter_number | string | Extracted from page text using pattern matching | "example value" | regex patterns, region: first_page_top |
| sections | array | Extracted from page text using pattern matching | [...] | regex patterns, region: headings |
| title | string | Extracted from page text using pattern matching | "example value" | regex patterns, region: first_page_top |
## Known Limitations
- **Books without chapter numbers**: Books with unnumbered chapters (e.g., named chapters only) may not match correctly
- **Multi-author books**: Chapters by different authors may not extract author if not explicitly labeled
- **Complex numbering**: Non-standard chapter numbering (e.g., "Chapter 3A") may not parse correctly
- **Front/back matter**: Prefaces, introductions, and conclusions without "chapter" labels may not match
- **Non-English books**: Books in other languages may not match pattern lists
- **Scanned books**: Poor OCR quality can lead to missed headings, especially in decorative fonts
- **Ebook exports**: Ebook PDF exports may have unusual layouts (e.g., flowing text) that confuse heading detection
- **Running header variations**: Books with alternating running headers (recto/verso) may not parse consistently
- **Boxed or sidebar sections**: Sidebars or boxed text may be incorrectly identified as sections
*This section documents known edge cases and failure modes. Contributions to improve extraction quality are welcome.*
- **Author extraction**: Assumes author is explicitly listed with "by:" or "author:" markers; books without explicit author attribution may miss this field
- **Section heading parsing**: Only captures top-level headings; nested subsections may be missed
- **Short chapters**: Chapters under 5 pages may not match (page_count_gte: 5)
- **Prefaces/introductions**: Front matter without clear chapter numbering may not match
- **Multi-chapter excerpts**: Excerpts containing multiple chapters may only extract the first chapter number
- **Non-English books**: Pattern matching is optimized for English terminology like "Chapter" and "Section"
## Sample Input
Example fixtures demonstrating this profile are available in `tests/fixtures/classifier/misc/`.
Example fixtures demonstrating this profile are available in `tests/fixtures/profiles/book_chapter/`.
Book chapter fixtures are typically multi-page documents with clear chapter headings and running headers.
*See the classifier corpus for representative documents.*
## Configuration Tips
To override this profile for custom book chapter formats:
To override this profile:
```bash
pdftract profiles export book_chapter > my-chapter.yaml
# Edit my-chapter.yaml to customize match criteria, fields, or extraction patterns
pdftract extract --profile my-chapter.yaml document.pdf
pdftract profiles export book_chapter > my-profile.yaml
# Edit my-profile.yaml to customize match criteria, fields, or extraction patterns
pdftract extract --profile my-profile.yaml document.pdf
```
Common customizations:
- Add book-specific section numbering patterns to `sections.extraction.patterns`
- For numbered sections (e.g., "1.1", "1.2"), adjust patterns to capture hierarchical numbering
- If chapters use roman numerals (I, II, III), ensure patterns include these formats
---
*This README documents the built-in `book_chapter` profile. See `docs/research/document-classification-and-zone-labeling.md` for classifier theory.*
*This README was auto-generated from `profile.yaml`. Update the Match Criteria Summary and Known Limitations sections with profile-specific guidance.*

View file

@ -4,16 +4,13 @@ Fillable form with fields; uses line_dominant reading order and form_fields from
## Match Criteria Summary
Documents matching this profile typically contain:
This profile matches fillable forms and questionnaires. Documents typically contain:
- **Strong text signals**: Words like "form 1099", "application form", "questionnaire", "please fill out", "required fields"
- **Structural signals**: Form field layout (blanks, checkboxes, labeled input areas), blank lines with colons
- **Page count**: Usually 1-10 pages (forms are typically concise)
- **Layout patterns**: Labels followed by blanks/underlines, checkboxes, signature blocks, structured fields
- **Explicit form markers**: "Form 1234", "Application form", "Questionnaire", "Please fill out", "Required fields"
- **Field layout**: Repeated label-value pairs with colons or underscores (e.g., "Name: ______", "Date: __/__/__")
- **Blank input areas**: Lines, boxes, or underscored areas for user input
The classifier looks for form-specific terminology combined with field layout patterns. Documents with "form" terminology AND blank fields match with highest confidence.
**Note**: This is a degenerate profile with **no field extractors**. It uses `line_dominant` reading order and surfaces all `form_fields` from Phase 7.4. The profile enables form-specific processing but does not extract named fields like other profiles.
This is a degenerate profile with **no field extractors** — it only identifies documents as forms and relies on the `form_fields` integration from Phase 7.4 for field extraction. Forms are typically 1-10 pages.
## Extracted Fields
@ -21,49 +18,32 @@ The classifier looks for form-specific terminology combined with field layout pa
|-------|------|-------------|----------------|-------------|
| *(none)* | - | *This profile has no field extractors* | - | - |
Instead of named fields, this profile integrates with Phase 7.4's `form_fields` system, which extracts:
- Text input fields (labels + values)
- Checkbox/radio button states
- Signature blocks
- Date fields
- Multi-line text areas
See Phase 7.4 documentation for the `form_fields` schema.
## Known Limitations
- **No named extraction**: Unlike other profiles, this does not return named fields; users must process `form_fields` output
- **Handwritten forms**: Handwritten responses may not be OCRed correctly
- **Complex layouts**: Forms with non-standard layouts (e.g., grids, nested sections) may confuse field detection
- **Checkboxes and radio buttons**: Checkbox states may be unreliable depending on PDF encoding
- **Multi-page forms**: Fields spanning page boundaries may be split incorrectly
- **Non-English forms**: Forms in other languages may not match pattern lists
- **Scanned forms**: Poor scan quality can lead to missed fields or incorrect label-value pairing
- **Dynamic forms**: Forms with conditional fields (e.g., "if yes, go to section B") are not interpreted
*This section documents known edge cases and failure modes. Contributions to improve extraction quality are welcome.*
- **No field extraction**: This profile only classifies documents as forms; actual field extraction is handled by the `form_fields` integration (Phase 7.4), which must be run separately
- **Pre-filled forms**: Forms with already-filled handwritten or typed responses may confuse the classifier's field layout detection
- **Complex layouts**: Forms with non-standard layouts (e.g., grids, nested tables, multi-column designs) may not be recognized
- **Scanned forms**: Poor scan quality may cause field labels to be missed or misclassified
- **Non-English forms**: Pattern matching is optimized for English terminology like "form", "application", "questionnaire"
## Sample Input
Example fixtures demonstrating this profile are available in `tests/fixtures/classifier/misc/`.
Example fixtures demonstrating this profile are available in `tests/fixtures/profiles/form/`.
Form fixtures are typically single-page documents with labeled fields and blanks for user input.
*See the classifier corpus for representative documents.*
## Configuration Tips
To override this profile for custom form formats:
To override this profile:
```bash
pdftract profiles export form > my-form.yaml
# Edit my-form.yaml to customize match criteria, fields, or extraction patterns
pdftract extract --profile my-form.yaml document.pdf
pdftract profiles export form > my-profile.yaml
# Edit my-profile.yaml to customize match criteria, fields, or extraction patterns
pdftract extract --profile my-profile.yaml document.pdf
```
Common customizations:
- Add form-specific patterns to `match.text_patterns` for proprietary form types
- If you need named field extraction, copy this profile and add `profile_fields` entries
- For government forms (e.g., IRS, USCIS), create specific profiles with known field mappings
**Integration with Phase 7.4**: This profile sets `form_fields_integration: true`, which enables the form field extraction pipeline. The extracted `form_fields` array is included in the output JSON.
---
*This README documents the built-in `form` profile. See `docs/research/document-classification-and-zone-labeling.md` for classifier theory and Phase 7.4 for `form_fields` schema.*
*This README was auto-generated from `profile.yaml`. Update the Match Criteria Summary and Known Limitations sections with profile-specific guidance.*

View file

@ -4,58 +4,52 @@ Court filing with case number, court, parties, filing date, docket
## Match Criteria Summary
Documents matching this profile typically contain:
This profile matches court filings and legal documents. Documents typically contain:
- **Strong text signals**: Words like "case #:", "docket #:", "court of", "superior court", "district court"
- **Structural signals**: Court header at top, page numbers, signature blocks
- **Page count**: Usually 1-100 pages (filings vary by document type)
- **Layout patterns**: Court caption at top (court name, case number, parties), document body, docket entries or certificate of service
- **Case/docket identifiers**: "Case #:", "Docket #:", "Civil Action No."
- **Court naming**: "Court of", "Superior Court", "District Court", "United States District Court"
- **Party designations**: "Plaintiff:", "Defendant:", "Petitioner:", "Respondent:" or "v." notation
- **Court header formatting**: Formal court headers at the top of pages with page numbers
The classifier looks for legal filing terminology combined with court header structures. Documents with "case/docket" terminology AND court headers match with highest confidence.
Court filings range from 1-100 pages. The profile expects formal legal formatting with case captions and party identification.
## Extracted Fields
| Field | Type | Description | Example Value | Source Hint |
|-------|------|-------------|----------------|-------------|
| case_number | string | Case or docket number | "CV-2024-001234" or "1:24-cv-00123" | "case", "docket", or "civil action no." fields |
| court | string | Court name | "Superior Court of California" | First page top, court name patterns |
| parties | array | Parties to the case | `["Smith", "Jones"]` | "plaintiff", "defendant", "petitioner", "respondent", or "v." patterns |
| filing_date | date | Date document was filed | 2024-01-15 | "filed", "submitted", or "date filed" fields |
| docket_entries | array | Docket or proceeding entries | `["[1] Complaint filed"]` | After "docket" heading, numbered list |
| case_number | string | Extracted from page text using pattern matching | "example value" | regex patterns |
| court | string | Extracted from page text using pattern matching | "example value" | regex patterns, region: first_page_top |
| docket_entries | array | Extracted from page text using pattern matching | [...] | regex patterns, region: after_docket_heading |
| filing_date | date | Extracted from page text using pattern matching | 2024-01-15 | regex patterns |
| parties | array | Extracted from page text using pattern matching | [...] | regex patterns |
## Known Limitations
- **Multi-case filings**: Filings referencing multiple cases may only extract the first case number
- **Sealed filings**: Redacted or sealed filings may have missing information
- **Exhibit attachments**: Exhibits attached to filings are not processed separately
- **Complex caption formats**: Some courts use non-standard caption formats that may not parse correctly
- **Non-English filings**: Filings in languages other than English may not match pattern lists
- **Scanned filings**: Poor OCR quality can lead to missed fields, especially in dense captions
- **Multiple parties**: Cases with many parties (e.g., class actions) may not extract all parties
- **Electronically filed documents**: Some e-filing systems add headers/footers that may interfere with extraction
- **State-specific formats**: Different states have different caption formats; some may not be supported
*This section documents known edge cases and failure modes. Contributions to improve extraction quality are welcome.*
- **Multi-party cases**: Only captures the first two parties (plaintiff/petitioner and defendant/respondent); additional parties are not extracted
- **Cross-claims and counterclaims**: Treated as separate parties; complex multi-party litigation may not extract all parties correctly
- **Sealed/redacted filings**: Redacted case numbers or party names may not extract correctly
- **International courts**: Pattern matching is optimized for US court naming conventions; non-US court formats may fail
- **Docket entry parsing**: Only captures bracketed docket entries ([1], [2]); alternative numbering formats may be missed
- **Amended filings**: Amendments are treated as separate documents; cross-references between filings are not resolved
## Sample Input
Example fixtures demonstrating this profile are available in `tests/fixtures/classifier/contract/`.
Example fixtures demonstrating this profile are available in `tests/fixtures/profiles/legal_filing/`.
Legal filing fixtures are typically multi-page documents with court captions at the top.
*See the classifier corpus for representative documents.*
## Configuration Tips
To override this profile for custom legal filing formats:
To override this profile:
```bash
pdftract profiles export legal_filing > my-filing.yaml
# Edit my-filing.yaml to customize match criteria, fields, or extraction patterns
pdftract extract --profile my-filing.yaml document.pdf
pdftract profiles export legal_filing > my-profile.yaml
# Edit my-profile.yaml to customize match criteria, fields, or extraction patterns
pdftract extract --profile my-profile.yaml document.pdf
```
Common customizations:
- Add court-specific patterns to `case_number.extraction.patterns`
- For state-specific formats, update `court.extraction.patterns` with local court names
- Adjust `parties.extraction.patterns` for different party types (e.g., "appellant", "appellee")
---
*This README documents the built-in `legal_filing` profile. See `docs/research/document-classification-and-zone-labeling.md` for classifier theory.*
*This README was auto-generated from `profile.yaml`. Update the Match Criteria Summary and Known Limitations sections with profile-specific guidance.*

View file

@ -4,59 +4,50 @@ Presentation slides with title, presenter, date, slide titles
## Match Criteria Summary
Documents matching this profile typically contain:
This profile matches presentation slides exported to PDF. Documents typically exhibit:
- **Strong text signals**: Words like "presentation", "slide N", "table of contents"
- **Structural signals**: Landscape page orientation (width > height), 3+ pages, large centered text blocks (titles)
- **Page count**: Usually 3-200 pages (presentations vary widely)
- **Layout patterns**: Title slide with centered text, subsequent slides with headings at top or left, content below
- **Landscape orientation**: Slides are almost always landscape (4:3 or 16:9 aspect ratio)
- **Large centred text**: Title slides have large, centered text
- **Multiple pages**: 3+ pages minimum; slide decks often run 10-200 pages
- **Slide numbering**: "Slide 1", "Slide 2", or table of contents
The classifier looks for presentation-specific terminology combined with landscape orientation and multiple pages. Landscape documents with 3+ pages AND large centered text match with highest confidence.
**Note**: Slide deck PDFs vary enormously in quality depending on the export method (PowerPoint "Save as PDF", Keynote export, Google Slides PDF, etc.). Extraction quality depends heavily on the exporter's text rendering.
This is a degenerate profile with minimal field extraction (title, presenter, date, slide titles) because slide-deck PDFs vary enormously depending on the presentation software and exporter.
## Extracted Fields
| Field | Type | Description | Example Value | Source Hint |
|-------|------|-------------|----------------|-------------|
| title | string | Presentation title | "Q4 2024 Business Review" | First slide, centered, large font |
| presenter | string | Presenter name(s) | "Jane Doe, CEO" | First slide, below title |
| date | string | Presentation date | 2024-01-15 | First slide, bottom area |
| slide_titles | array | Title of each slide | `["Overview", "Financial Results", "Q&A"]` | Per-slide extraction, top or center |
| date | date | Extracted from page text using pattern matching | 2024-01-15 | regex patterns, region: first_page_bottom |
| presenter | string | Extracted from page text using pattern matching | "example value" | regex patterns, region: first_page_below_title |
| slide_titles | array | Extracted from page text using pattern matching | [...] | regex patterns, region: top_left_or_centre, per-page |
| title | string | Extracted from page text using pattern matching | "example value" | regex patterns, region: first_page_centre |
## Known Limitations
- **Export quality**: Poor PDF exports (e.g., from old PowerPoint versions) may have garbled text or text-as-images
- **Image-only slides**: Slides that are pure images (e.g., photos, screenshots) have no extractable text
- **Complex layouts**: Slides with multiple text boxes, overlapping elements, or non-standard layouts may not extract correctly
- **Animations and transitions**: PDF exports capture final state only; animated builds are not represented
- **Non-title slides**: Slides without clear titles (e.g., photo-only slides) will have null entries in `slide_titles`
- **Speaker notes**: Speaker notes are typically not included in PDF exports and cannot be extracted
- **Non-English presentations**: Presentations in other languages may not match pattern lists
- **Handout formats**: 3-up or 6-up handout PDFs (multiple slides per page) are not supported
- **Portrait slides**: Rare portrait-orientation slides may not be detected correctly
*This section documents known edge cases and failure modes. Contributions to improve extraction quality are welcome.*
- **Exporter variability**: Slide-deck PDFs vary enormously depending on the presentation software (PowerPoint, Keynote, Google Slides) and PDF exporter; extraction quality depends heavily on how text was converted to PDF
- **Image-heavy slides**: Slides with minimal text (e.g., photo slides, diagrams) will not produce meaningful slide_titles
- **Non-standard layouts**: Slides without clear title regions (e.g., all-center layouts, artistic templates) may not extract slide_titles correctly
- **Presenter extraction**: Assumes the presenter name appears below the title on the first slide; alternative formats (e.g., title slide with no presenter) will miss this field
- **Date parsing**: Date extraction from first-page footer may fail if the presentation date is in a non-standard format
## Sample Input
Example fixtures demonstrating this profile are available in `tests/fixtures/classifier/misc/`.
Example fixtures demonstrating this profile are available in `tests/fixtures/profiles/slide_deck/`.
Slide deck fixtures are typically landscape PDFs with clear title slides and multiple pages.
*See the classifier corpus for representative documents.*
## Configuration Tips
To override this profile for custom slide deck formats:
To override this profile:
```bash
pdftract profiles export slide_deck > my-slides.yaml
# Edit my-slides.yaml to customize match criteria, fields, or extraction patterns
pdftract extract --profile my-slides.yaml document.pdf
pdftract profiles export slide_deck > my-profile.yaml
# Edit my-profile.yaml to customize match criteria, fields, or extraction patterns
pdftract extract --profile my-profile.yaml document.pdf
```
Common customizations:
- Adjust `slide_titles.extraction.per_page` to false if you only want the first slide's title
- For handout formats (multiple slides per page), consider creating a custom profile with different extraction logic
- Add presenter-specific patterns if your organization uses consistent presenter naming conventions
---
*This README documents the built-in `slide_deck` profile. See `docs/research/document-classification-and-zone-labeling.md` for classifier theory.*
*This README was auto-generated from `profile.yaml`. Update the Match Criteria Summary and Known Limitations sections with profile-specific guidance.*

View file

@ -33,6 +33,7 @@ struct ExtractionConfig {
#[serde(default)]
per_page: Option<bool>,
#[serde(default)]
#[allow(dead_code)]
fallback: serde_yaml::Value,
}