Add DiagnosticsCollector type for thread-safe diagnostic aggregation, add hint field to DiagnosticJson, add missing error codes (IMG_SOURCE_MIXED, PROFILE_INVALID, REPAIR_RESCUED_FROM_BACKWARDS_XREF), and create comprehensive diagnostics documentation. Changes: - DiagnosticsCollector: Arc<Mutex<Vec<Diagnostic>>> wrapper with emit() helpers for emitting diagnostics from multiple threads - DiagnosticJson: add hint: Option<String> field for suggested actions - DiagCode: add ImgSourceMixed, ProfileInvalid, RepairRescuedFromBackwardsXref - docs/integrations/diagnostics-codes.md: comprehensive code catalog Closes: pdftract-2u6q2
288 lines
13 KiB
Markdown
288 lines
13 KiB
Markdown
# pdftract Diagnostic Codes
|
|
|
|
This document catalogs all diagnostic codes emitted by pdftract during PDF extraction. Each diagnostic has a stable SCREAMING_SNAKE_CASE identifier, a severity level, and suggested user action.
|
|
|
|
## Diagnostic Format
|
|
|
|
All diagnostics follow this structure:
|
|
|
|
```json
|
|
{
|
|
"code": "DIAGNOSTIC_CODE",
|
|
"message": "Human-readable description",
|
|
"severity": "info|warning|error|fatal",
|
|
"page_index": null | 0-based page number,
|
|
"location": null | {"object_number": N, "generation_number": G},
|
|
"hint": null | "Suggested action"
|
|
}
|
|
```
|
|
|
|
## Code Categories
|
|
|
|
### STRUCT_* — PDF Structure Errors
|
|
|
|
Errors related to PDF syntax, object parsing, and document structure.
|
|
|
|
| Code | Severity | Description | Phase |
|
|
|------|----------|-------------|-------|
|
|
| `STRUCT_INVALID_NAME` | Warning | Invalid name character or malformed name object | 1.1 |
|
|
| `STRUCT_INVALID_HEX` | Warning | Invalid hex character in hex string or name escape | 1.1 |
|
|
| `STRUCT_INVALID_OCTAL` | Warning | Invalid octal escape sequence in literal string | 1.1 |
|
|
| `STRUCT_INVALID_STREAM_HEADER` | Warning | Invalid stream header (stream keyword not followed by proper newline) | 1.1 |
|
|
| `STRUCT_UNEXPECTED_BYTE` | Warning | Unexpected byte (e.g., stray `>` not part of `>>`) | 1.1 |
|
|
| `STRUCT_UNEXPECTED_EOF` | Warning | Unexpected end of file while parsing a token | 1.1 |
|
|
| `STRUCT_UNTERMINATED_STRING` | Warning | Unterminated literal string (missing closing paren) | 1.1 |
|
|
| `STRUCT_MISSING_KEY` | Warning | Missing required dictionary key | 1.4 |
|
|
| `STRUCT_CIRCULAR_REF` | Warning | Circular reference detected (A → B → A) | 1.2 |
|
|
| `STRUCT_XOBJECT_CYCLE` | Warning | Form XObject cycle detected | 3.3 |
|
|
| `STRUCT_DEPTH_EXCEEDED` | Warning | Dictionary nesting depth exceeds limit | 1.2 |
|
|
| `STRUCT_INVALID_DICT_VALUE` | Warning | Invalid dictionary value (missing value after key) | 1.2 |
|
|
| `STRUCT_INVALID_DICT_KEY` | Warning | Invalid dictionary key (not a name object) | 1.2 |
|
|
| `STRUCT_INVALID_INDIRECT_HEADER` | Warning | Invalid indirect object header (`N G obj`) | 1.2 |
|
|
| `STRUCT_INTEGER_OVERFLOW` | Warning | Integer overflow during parsing | 1.2 |
|
|
| `STRUCT_REAL_INVALID` | Warning | Invalid real number literal | 1.1 |
|
|
| `STRUCT_INVALID_NUMBER` | Warning | Invalid numeric literal | 1.1 |
|
|
| `STRUCT_INVALID_ASCII85` | Warning | Invalid ASCII85 character or malformed stream | 1.5 |
|
|
| `STRUCT_INVALID_OBJSTM` | Warning | Invalid object stream format | 1.2 |
|
|
| `STRUCT_INVALID_GEOMETRY` | Warning | Invalid geometry value (NaN or Inf in MediaBox/CropBox/Rotate) | 1.7 |
|
|
| `STRUCT_INVALID_TYPE` | Warning | Invalid object type (expected type not found) | 5.2.1 |
|
|
| `STRUCT_INVALID_UTF16` | Warning | Invalid UTF-16BE encoding in string | 1.4 |
|
|
| `STRUCT_UNRESOLVED_DESTINATION` | Warning | Unresolved named destination | 1.4 |
|
|
| `STRUCT_NON_GOTO_OUTLINE` | Warning | Non-GoTo action in outline | 1.4 |
|
|
| `STRUCT_INVALID_PDFDOC_ENCODING` | Warning | Invalid PDFDocEncoding in string | 1.4 |
|
|
| `STRUCT_HYBRID_CONFLICT` | Warning | Hybrid xref conflict: traditional and stream disagree | 1.3 |
|
|
| `STRUCT_INCOMPLETE_COVERAGE` | Info | StructTree coverage below 80% with /Suspects true | 7.1.4 |
|
|
| `STRUCT_INVALID_PREV_OFFSET` | Warning | Invalid /Prev offset in xref chain | 1.3 |
|
|
| `STRUCT_INVALID_BDC_OPERAND` | Info | Invalid BDC operand | 3.4 |
|
|
|
|
### XREF_* — Cross-Reference Table Errors
|
|
|
|
Errors related to the xref table and trailer.
|
|
|
|
| Code | Severity | Description | Phase |
|
|
|------|----------|-------------|-------|
|
|
| `XREF_INVALID_HEADER` | Warning | Invalid xref keyword or header | 1.3 |
|
|
| `XREF_INVALID_ENTRY` | Warning | Malformed xref entry (not 20 bytes, bad format) | 1.3 |
|
|
| `XREF_INVALID_SUBSECTION_HEADER` | Warning | Invalid subsection header (not "start count") | 1.3 |
|
|
| `XREF_OBJECT_ZERO_NOT_FREE` | Warning | Object 0 is not free (violates PDF spec) | 1.3 |
|
|
| `XREF_TRAILER_NOT_FOUND` | Warning | Trailer dictionary not found or malformed | 1.3 |
|
|
| `XREF_TRUNCATED` | Warning | Truncated xref table (unexpected EOF) | 1.3 |
|
|
| `XREF_REPAIRED` | Info | Xref was reconstructed via forward scan (EC-07) | 1.3 |
|
|
| `XREF_LINEARIZED_NO_FORWARD_SCAN` | Warning | Forward scan disabled for linearized files | 1.3 |
|
|
| `XREF_REMOTE_NO_FORWARD_SCAN` | Warning | Forward scan disabled for HTTP sources | 1.3 |
|
|
| `XREF_INVALID_STREAM_FORMAT` | Warning | Invalid xref stream format | 1.3 |
|
|
| `XREF_INVALID_STREAM_ENTRY` | Warning | Invalid xref stream entry | 1.3 |
|
|
|
|
### STREAM_* — Stream Decoder Errors
|
|
|
|
Errors related to stream decompression and filters.
|
|
|
|
| Code | Severity | Description | Phase |
|
|
|------|----------|-------------|-------|
|
|
| `STREAM_DECODE_ERROR` | Warning | Stream decompression failed (corrupt data) | 1.5 |
|
|
| `STREAM_BOMB` | Error | Decompression bomb limit exceeded | 1.5 |
|
|
| `STREAM_UNKNOWN_FILTER` | Warning | Unknown filter name | 1.5 |
|
|
| `STREAM_INVALID_PARAMS` | Warning | Invalid filter parameters | 1.5 |
|
|
| `STREAM_INVALID_JPEG` | Warning | JPEG data has invalid or missing markers | 1.5 |
|
|
| `STREAM_INVALID_CCITT` | Warning | CCITT fax data has invalid or missing parameters | 1.5 |
|
|
| `STREAM_TRUNCATED` | Warning | Stream data truncated | 1.5 / 5.2.1 |
|
|
|
|
### ENCRYPTION_* — Encryption Errors
|
|
|
|
Errors related to PDF encryption and passwords.
|
|
|
|
| Code | Severity | Description | Phase |
|
|
|------|----------|-------------|-------|
|
|
| `ENCRYPTION_UNSUPPORTED` | Fatal | Unsupported encryption or no password supplied | 1.4 |
|
|
| `ENCRYPTION_WRONG_PASSWORD` | Fatal | Password incorrect | 1.4 |
|
|
|
|
### PAGE_* — Page-Level Errors
|
|
|
|
Errors related to page structure and properties.
|
|
|
|
| Code | Severity | Description | Phase |
|
|
|------|----------|-------------|-------|
|
|
| `PAGE_OUT_OF_RANGE` | Error | Page number out of range | 1.8 |
|
|
| `PAGE_INVALID_COUNT` | Warning | Invalid /Count in /Pages tree | 1.4 |
|
|
| `PAGE_INVALID_ROTATE` | Warning | Invalid /Rotate value (not multiple of 90) | 1.4 |
|
|
|
|
### FONT_* — Font Pipeline Errors
|
|
|
|
Errors related to font parsing and glyph mapping.
|
|
|
|
| Code | Severity | Description | Phase |
|
|
|------|----------|-------------|-------|
|
|
| `FONT_GLYPH_UNMAPPED` | Warning | Glyph could not be mapped to Unicode | 2.2 |
|
|
| `FONT_NOT_FOUND` | Warning | Font not found or couldn't be parsed | 2.1 |
|
|
| `FONT_INVALID_CMAP` | Warning | Invalid CMap format | 2.2 |
|
|
| `FONT_PARSE_FAILED` | Warning | Font program parsing failed | 2.1 |
|
|
| `FONT_UNSUPPORTED` | Warning | Font type not supported for embedded loading | 2.1 |
|
|
| `FONT_CIDTOGIDMAP_TRUNCATED` | Warning | CIDToGIDMap stream has odd byte count | 2.1 |
|
|
| `ENCODING_DIFFERENCE_OUT_OF_RANGE` | Warning | Character code in /Differences exceeds valid range | 2.2 |
|
|
| `FONT_TYPE3_WIDTHS_LENGTH_MISMATCH` | Warning | Type3 font /Widths array length mismatch | 2.4 |
|
|
|
|
### CJK_* — CJK Encoding Errors
|
|
|
|
Errors related to CJK character encoding.
|
|
|
|
| Code | Severity | Description | Phase |
|
|
|------|----------|-------------|-------|
|
|
| `CJK_DECODE_MALFORMED` | Warning | Malformed byte sequence in CJK encoding | 2.3 |
|
|
|
|
### OCR_* — OCR Pipeline Errors
|
|
|
|
Errors related to OCR processing.
|
|
|
|
| Code | Severity | Description | Phase |
|
|
|------|----------|-------------|-------|
|
|
| `OCR_JBIG2_UNSUPPORTED` | Warning | JBIG2 decoder not available | 1.5 / 5.2 |
|
|
| `OCR_JPX_UNSUPPORTED` | Warning | JPEG2000 (JPX) decoder not available | 1.5 / 5.2 |
|
|
| `OCR_CCITT_UNSUPPORTED` | Warning | CCITT fax decoder not available | 1.5 / 5.2 |
|
|
| `OCR_TESSERACT_FAILED` | Warning | Tesseract OCR failed | 5.4 |
|
|
| `OCR_BROKENVECTOR_UNAVAILABLE` | Warning | OCR unavailable on broken-vector page | 4.7 |
|
|
| `OCR_LANGUAGE_UNAVAILABLE` | Warning | Requested OCR language pack not available | 5.4 |
|
|
|
|
### IMG_* — Image Processing Errors
|
|
|
|
Errors related to image extraction and processing.
|
|
|
|
| Code | Severity | Description | Phase |
|
|
|------|----------|-------------|-------|
|
|
| `IMG_SOFTMASK_UNSUPPORTED` | Warning | Image soft mask not supported in direct compositing | 5.2.1 |
|
|
| `IMG_UNSUPPORTED_FORMAT` | Warning | Image format not supported | 5.2.1 |
|
|
| `IMG_DESKEW_OUT_OF_RANGE` | Warning | Deskew angle out of detectable range | 5.3.1 |
|
|
| `IMG_SOURCE_MIXED` | Warning | Image sources mixed in unexpected way | 5.3.2 |
|
|
|
|
### REMOTE_* — Remote Source Errors
|
|
|
|
Errors related to HTTP fetching and remote sources.
|
|
|
|
| Code | Severity | Description | Phase |
|
|
|------|----------|-------------|-------|
|
|
| `REMOTE_FETCH_INTERRUPTED` | Error | HTTP fetch interrupted or failed | 1.8 |
|
|
| `REMOTE_NO_RANGE_SUPPORT` | Warning | Server does not support Range requests | 1.8 |
|
|
| `REMOTE_TLS_FAILED` | Fatal | TLS handshake failed | 1.8 |
|
|
| `REMOTE_DNS_FAILED` | Fatal | DNS resolution failed | 1.8 |
|
|
| `REMOTE_URL_PRIVATE_NETWORK` | Error | URL targets private network (SSRF protection) | 1.8 |
|
|
|
|
### GSTATE_* — Graphics State Errors
|
|
|
|
Errors related to graphics state operators.
|
|
|
|
| Code | Severity | Description | Phase |
|
|
|------|----------|-------------|-------|
|
|
| `GSTATE_STACK_OVERFLOW` | Warning | Graphics state stack overflow | 3.1 |
|
|
| `GSTATE_STACK_UNDERFLOW` | Warning | Graphics state stack underflow | 3.1 |
|
|
| `GSTATE_BT_ET_MISMATCH` | Warning | Mismatched BT/ET pair | 3.1 |
|
|
| `CM_ARG_COUNT` | Warning | Invalid argument count for cm operator | 3.1 |
|
|
| `CM_DEGENERATE` | Warning | Degenerate matrix (det == 0 or NaN) | 3.1 |
|
|
| `HORIZ_SCALING_ZERO` | Warning | Horizontal scaling set to zero (Tz 0) | 3.1 |
|
|
| `TEXT_RENDERING_MODE_CLAMPED` | Warning | Text rendering mode clamped to valid range | 3.1 |
|
|
| `TSTAR_ZERO_LEADING` | Warning | T* operator when leading == 0 | 3.1 |
|
|
| `FONT_RESOURCE_NOT_FOUND` | Warning | Font resource not found | 3.1 |
|
|
| `FONT_SIZE_ZERO_OR_NEGATIVE` | Warning | Font size zero or negative | 3.1 |
|
|
| `BT_NESTED` | Warning | BT operator nested inside another BT block | 3.1 |
|
|
| `ET_WITHOUT_BT` | Warning | ET operator without matching BT | 3.1 |
|
|
| `TEXT_SHOW_OUTSIDE_BT` | Warning | Text-show operator outside BT/ET block | 3.1 |
|
|
|
|
### LAYOUT_* — Layout and Reading Order Errors
|
|
|
|
Errors related to layout analysis and reading order.
|
|
|
|
| Code | Severity | Description | Phase |
|
|
|------|----------|-------------|-------|
|
|
| `TAGGED_PDF_STRUCT_TREE_DEFERRED` | Info | Tagged PDF StructTree deferred to Phase 7 | 4.5 |
|
|
| `LAYOUT_READING_ORDER_AMBIGUOUS` | Warning | Reading order may be incorrect | 4.5 |
|
|
| `LAYOUT_LOW_READABILITY` | Warning | Low readability score | 4.7 |
|
|
|
|
### MCP_* — MCP Server Errors
|
|
|
|
Errors related to MCP server operations.
|
|
|
|
| Code | Severity | Description | Phase |
|
|
|------|----------|-------------|-------|
|
|
| `MCP_TOOL_INVALID_PARAMS` | Error | MCP tool call has invalid parameters | 6.7 |
|
|
| `MCP_PATH_TRAVERSAL` | Error | MCP path traversal attempt | 6.7 |
|
|
|
|
### CACHE_* — Cache Errors
|
|
|
|
Errors related to caching operations.
|
|
|
|
| Code | Severity | Description | Phase |
|
|
|------|----------|-------------|-------|
|
|
| `CACHE_ENTRY_CORRUPT` | Warning | Cache entry is corrupted | 6.9 |
|
|
| `CACHE_WRITE_FAILED` | Warning | Cache write failed | 6.9 |
|
|
|
|
### MARKED_CONTENT_* — Marked Content Errors
|
|
|
|
Errors related to marked content operators.
|
|
|
|
| Code | Severity | Description | Phase |
|
|
|------|----------|-------------|-------|
|
|
| `EMC_WITHOUT_BMC` | Info | EMC operator without matching BMC/BDC | 3.4 |
|
|
| `MARKED_CONTENT_DEPTH_EXCEEDED` | Info | Marked-content stack depth exceeded | 3.4 |
|
|
| `UNKNOWN_MARKED_CONTENT_PROPS` | Info | Unknown marked-content property name | 3.4 |
|
|
| `MCID_REDEFINED` | Info | MCID redefined in same scope | 3.4 |
|
|
|
|
### PROFILE_* — Profile Errors
|
|
|
|
Errors related to profile configuration.
|
|
|
|
| Code | Severity | Description | Phase |
|
|
|------|----------|-------------|-------|
|
|
| `PROFILE_SECRETS_FORBIDDEN` | Error | Profile YAML contains forbidden secret keys | 7.10 |
|
|
| `PROFILE_INVALID` | Error | Profile YAML is invalid or malformed | 5.6.2 |
|
|
|
|
### REPAIR_* — Repair Recovery
|
|
|
|
Errors related to document repair operations.
|
|
|
|
| Code | Severity | Description | Phase |
|
|
|------|----------|-------------|-------|
|
|
| `REPAIR_RESCUED_FROM_BACKWARDS_XREF` | Info | Xref repaired from backwards scan | 1.3 |
|
|
|
|
### SECURITY_* — Security Diagnostics
|
|
|
|
Security-related diagnostics.
|
|
|
|
| Code | Severity | Description | Phase |
|
|
|------|----------|-------------|-------|
|
|
| `JAVASCRIPT_PRESENT` | Info | JavaScript present in PDF (never executed) | 1.2 |
|
|
|
|
## Adding New Diagnostic Codes
|
|
|
|
When adding a new diagnostic code:
|
|
|
|
1. Choose a category prefix (STRUCT, STREAM, XREF, etc.)
|
|
2. Add the variant to the `DiagCode` enum in `crates/pdftract-core/src/diagnostics.rs`
|
|
3. Add the name mapping in `DiagCode::name()`
|
|
4. Add the category mapping in `DiagCode::category()`
|
|
5. Add the severity mapping in `DiagCode::severity()`
|
|
6. Add a catalog entry to `DIAGNOSTIC_CATALOG`
|
|
7. Add an entry to this document
|
|
|
|
**Code naming convention:** `CATEGORY_SPECIFIC_ISSUE` (SCREAMING_SNAKE_CASE)
|
|
|
|
**Severity levels:**
|
|
- `Info` — does not affect output validity
|
|
- `Warning` — output is usable but degraded
|
|
- `Error` — output for this region/page is invalid; other regions OK
|
|
- `Fatal` — extraction aborted, no usable output
|
|
|
|
## Programmatic Usage
|
|
|
|
Diagnostics can be consumed programmatically:
|
|
|
|
```python
|
|
import json
|
|
|
|
result = json.loads(pdftract_output)
|
|
for error in result.get('errors', []):
|
|
code = error['code']
|
|
severity = error['severity']
|
|
page = error.get('page_index')
|
|
|
|
if code == 'OCR_BROKENVECTOR_UNAVAILABLE':
|
|
# Install Tesseract for OCR recovery
|
|
print(f"Page {page}: Install Tesseract for OCR recovery")
|
|
```
|