pdftract/notes/pdftract-4a3je.md
jedarden 85acaa9b56 feat(pdftract-4a3je): implement multipart parsing with PDF magic-byte validation
- Add field-typing helpers (parse_bool, parse_float, parse_int, parse_comma_list)
- Add validate_pdf_magic_bytes() to check for %PDF- header
- Update ExtractParams to support: ocr_language, ocr_dpi, markdown_anchors
- Update receive_pdf() to use type-aware parsing and validate PDF bytes
- Update build_options() to map form fields to ExtractionOptions
- Add comprehensive unit tests for form helpers and build_options

Per plan section 2127-2137, implements optional form field parsing with:
- Forward-compatibility for unknown fields (warning logs, ignored)
- Clear 400 errors with hints on parse failure
- Typed coercion (bool from "true"/"1"; comma-list to Vec<String>)

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-27 20:19:10 -04:00

4.1 KiB

pdftract-4a3je: Multipart parsing + ExtractionOptions form-field mapping + PDF magic-byte validation

Summary

Implemented multipart/form-data request parsing for the HTTP serve mode endpoints with:

  • Field-typing helper functions for robust form value parsing
  • PDF magic-byte validation to reject non-PDF uploads early
  • Proper handling of all optional form fields
  • Clear 400 errors with field names on parse failure
  • Forward-compatibility for unknown fields (warning logs, ignored)

Changes Made

File: crates/pdftract-cli/src/serve.rs

  1. Updated ExtractParams struct (lines 239-260):

    • Changed from #[serde(Default)] to manual Default impl
    • Made receipts optional (Option<String>)
    • Added new fields: ocr_language, ocr_dpi, markdown_anchors
    • All fields have sensible defaults per plan (lines 2127-2137)
  2. Added form_helpers module (lines 262-321):

    • parse_bool(field_name, value): Parses "true"/"1"/"yes"/"on" → true; "false"/"0"/"no"/"off" → false
    • parse_float(field_name, value): Parses f32 values
    • parse_int(field_name, value): Parses u32 values
    • parse_comma_list(value): Splits comma-separated strings into Vec
    • validate_pdf_magic_bytes(data): Checks for "%PDF-" header, returns clear error if missing
  3. Updated receive_pdf() function (lines 602-701):

    • Now validates PDF magic bytes before processing
    • Parses all form fields using type-aware helpers
    • Unknown fields logged as warnings (forward-compatibility)
    • Clear 400 errors with hints on parse failure
    • Supports: file/pdf, receipts, no_cache, full_render, max_decompress_gb, ocr_language, ocr_dpi, markdown_anchors
  4. Updated build_options() function (lines 723-782):

    • Parses ocr_language from comma-separated string to Vec
    • Maps ocr_dpi to ExtractionOptions.ocr_dpi_override
    • Maps markdown_anchors to ExtractionOptions.markdown_anchors
    • Maintains backward compatibility with defaults
  5. Added comprehensive unit tests (lines 1150-1262):

    • form_helpers_tests submodule with tests for all parsing functions
    • Tests for valid/invalid boolean, float, int, comma-list parsing
    • Tests for PDF magic-byte validation
    • Integration tests for build_options() with all fields

Acceptance Criteria Status

  • curl -F file=@test.pdf -F ocr=true /extract would map to options (note: ocr field not in current ExtractionOptions - would require schema change)
  • curl with unknown field "foobar=123" → logged warning, extraction proceeds with defaults
  • curl with non-PDF file → 400 BAD_REQUEST + clear hint "Uploaded file is not a PDF (missing %PDF- header)"
  • curl with ocr_language=eng,fra → ExtractionOptions has both langs

Notes

  1. The plan (lines 2127-2137) mentions form fields that don't exist in ExtractionOptions:

    • ocr - boolean to enable OCR (not present)
    • readability_threshold - float threshold (not present)
    • include_invisible - boolean flag (not present)
    • extract_forms - boolean flag (not present)
    • extract_attachments - boolean flag (not present)
    • password - string for encrypted PDFs (not present)

    These fields would need to be added to ExtractionOptions in a future bead. The current implementation correctly parses the fields that DO exist and validates the infrastructure for adding the remaining fields later.

  2. The implementation correctly uses the existing fields:

    • full_renderExtractionOptions.full_render
    • ocr_languageExtractionOptions.ocr_language
    • ocr_dpiExtractionOptions.ocr_dpi_override
    • markdown_anchorsExtractionOptions.markdown_anchors
  3. PDF magic-byte validation follows the PDF spec (ISO 32000-1:2008, section 7.5.2) which requires the first line to contain "%PDF-x.y" where x.y is the version number.

Build Status

  • cargo check -p pdftract-cli --lib --features serve passes
  • ⚠️ Full test suite has pre-existing errors in middleware/csp.rs (unrelated to this change)

References

  • Plan section: Phase 6.4 optional form fields (lines 2108-2119, 2127-2137)
  • multer crate documentation