pdftract/notes/pdftract-22q8e.md

# Bead pdftract-22q8e: --highlight DIR annotated PDF writer

## Summary

Implemented the foundation for the `--highlight DIR` feature that writes annotated PDFs with /Highlight annotations for grep matches.

## What was implemented

### 1. Created `highlight.rs` module (crates/pdftract-cli/src/grep/highlight.rs)

- `group_matches_by_file_and_page()`: Groups match events by file and page for efficient batch writing
- `write_highlighted_pdfs()`: Main entry point that:
  - Groups matches by file
  - Generates output paths with collision handling (-1, -2 suffixes)
  - Calls per-file writer
- `write_single_highlighted_pdf()`: Placeholder that currently copies the file (full incremental update TODO)
- `create_highlight_annotation()`: Creates /Highlight annotation dict with:
  - /Type /Annot, /Subtype /Highlight
  - /Rect from match bbox
  - /QuadPoints [x0,y0, x1,y0, x1,y1, x0,y1] (BL, BR, TR, TL per PDF 1.7 spec)
  - /C [1.0, 1.0, 0.0] (yellow RGB)
  - /F 4 (print flag)
  - /CA 0.4 (opacity)
  - /T "pdftract grep" (author)
  - /Contents with match text

### 2. Module integration

- Added highlight module to `grep/mod.rs` with public exports
- Made progress module conditional on `grep` feature to fix compilation
- Fixed borrow issues in `worker.rs`

### 3. Tests

- `test_group_matches_by_file_and_page()`: Verifies correct grouping
- `test_group_matches_empty()`: Edge case handling
- `test_create_highlight_annotation()`: Verifies annotation structure

## Acceptance criteria status

### PASS
- Grouping logic correctly groups matches by file and page
- Annotation dictionary contains all required fields per PDF 1.7 spec 12.5.6.10
- /QuadPoints order follows spec (BL, BR, TR, TL)
- Output filename collision handling with -1/-2 suffixes
- Directory auto-creation via `create_dir_all` in `validate()`
- Module compiles without warnings

### WARN (known limitations)
- `write_single_highlighted_pdf()` currently does a simple file copy instead of incremental update
- No actual annotation objects are written to the PDF yet
- No xref table update
- Cannot verify annotation count or round-trip extraction yet

### FAIL (not yet implemented)
- /Highlight annotation count in output matches MatchEvent count (needs full incremental update)
- Original PDF byte-identical to input (needs verification)
- Incremental-update structure verified by xref-table inspection (needs implementation)
- Encrypted PDFs skipped with diagnostic (needs implementation)
- Output validity testing (Acrobat, Chrome, etc.)

## Technical notes

The full incremental update implementation requires:
1. Parse xref table to find max object number
2. Create annotation dict objects with proper object numbers
3. Update page /Annots arrays (may need to create new page objects if /Annots is indirect)
4. Write new objects at end of file
5. Write new xref table and trailer with `/Prev` pointing to old xref offset

This is a significant undertaking that requires careful handling of:
- Object number allocation
- Dictionary vs indirect object references
- Xref table format (traditional vs stream)
- Trailer dictionary preservation

## Next steps for full implementation

1. Implement incremental PDF update writer in `write_single_highlighted_pdf()`
2. Add encrypted PDF detection and skip with diagnostic
3. Add verification tests (annotation count, xref inspection, round-trip extraction)
4. Add headless Chrome screenshot test for visual verification

## Files modified

- `crates/pdftract-cli/src/grep/highlight.rs` (new)
- `crates/pdftract-cli/src/grep/mod.rs`
- `crates/pdftract-cli/src/grep/worker.rs`

## Test results

- Library compiles successfully: `cargo check --package pdftract-cli --lib` ✓
- No clippy warnings in grep module ✓
- Tests pass for grouping and annotation creation (note: full integration tests blocked by pre-existing compilation errors in other modules)