pdftract/profiles/builtin/scientific_paper
jedarden 21fcd902d1 feat(pdftract-2vajs): implement slide_deck profile with fixtures and tests
Implements the slide_deck document profile for PowerPoint/Keynote/Google
Slides exports as PDF. Includes 5 fixtures, expected outputs, and regression
tests.

Components:
- profiles/builtin/slide_deck/profile.yaml - Profile configuration
- tests/fixtures/profiles/slide_deck/ - 5 PDF fixtures with expected outputs
- crates/pdftract-cli/tests/test_slide_deck.rs - Regression tests (12 PASS)

Fixtures cover:
1. pitch_deck - Sales pitch (10 slides)
2. academic_lecture - Academic lecture (40 slides)
3. corporate_kickoff - Corporate kickoff (15 slides)
4. bilingual_deck - Bilingual EN/ES (12 slides)
5. googleslides_handout - Google Slides handout mode (4 pages, 3 slides/page)

Extracted fields: title, presenter, date, slide_titles

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-27 21:12:24 -04:00
..
profile.yaml feat(pdftract-2vajs): implement slide_deck profile with fixtures and tests 2026-05-27 21:12:24 -04:00
README.md docs(pdftract-4iier): complete per-profile README documentation 2026-05-18 00:35:35 -04:00

SCIENTIFIC_PAPER Profile

Academic paper with title, authors, abstract, DOI, references

Match Criteria Summary

A document matches this profile when it displays the characteristic structure of an academic or scientific paper. The classifier identifies section headings like "abstract", "introduction", "keywords", and "references". Structurally, scientific papers are recognized by their two-column layout (common in journal publications) and the presence of a bibliography or references section. DOI identifiers are strong matching signals when present. Page counts typically range from 4-30 pages for conference papers and journal articles. The combination of author affiliations, abstract text, and structured sections distinguishes this profile from other document types.

Extracted Fields

Field Type Description Example Value Source Hint
title string Extracted from page text using pattern matching "example value" regex patterns, region: first_page_top
authors array Extracted from page text using pattern matching [...] regex patterns, region: first_page_top_below_title
abstract string Extracted from page text using pattern matching "example value" regex patterns, region: after_abstract_heading
doi string Extracted from page text using pattern matching "example value" regex patterns
journal string Extracted from page text using pattern matching "example value" regex patterns
publication_date date Extracted from page text using pattern matching 2024-01-15 regex patterns
references array Extracted from page text using pattern matching [...] region: after_references_heading

Known Limitations

  • DOIs in footnotes or page headers are not extracted; only first-page DOIs are picked up
  • Papers with non-standard author formats (e.g., very long author lists, "et al." handling) may truncate author lists
  • Abstract extraction may include section heading text if abstract boundaries are ambiguous
  • Two-column layout detection may fail for single-column format papers (e.g., some arXiv preprints)
  • References extraction captures numbered citations but may not handle unstructured reference formats
  • Non-English papers may not match due to English-only section heading patterns
  • Papers with complex figure/table layouts interrupting text flow may have extraction errors
  • Conference proceedings vs. journal distinctions are not made; both match this profile

Sample Input

Example fixtures demonstrating this profile are available in tests/fixtures/profiles/scientific_paper/.

See the classifier corpus for representative documents.

Configuration Tips

To override this profile:

pdftract profiles export scientific_paper > my-profile.yaml
# Edit my-profile.yaml to customize match criteria, fields, or extraction patterns
pdftract extract --profile my-profile.yaml document.pdf

For papers from specific venues (e.g., ACM, IEEE), consider adding venue-specific patterns to the journal field extraction. For preprints or conference papers, you may want to adjust the match.structural signals to not require two-column layout.


This README was auto-generated from profile.yaml. Update the Match Criteria Summary and Known Limitations sections with profile-specific guidance.