docs(pdftract-69iwi): Update verification note with test results

All 5 critical tests from Phase 1.8 pass:
- Range support with bandwidth efficiency
- No Range fallback
- 416 retry without Range
- Linearized hint stream prefetch
- Connection drop handling

Mock-server test corpus is complete (13/13 tests pass).
This commit is contained in:
jedarden 2026-06-02 18:32:44 -04:00
parent 2ec317dea1
commit 04594768bf

View file

@ -1,157 +1,105 @@
# Bead pdftract-69iwi: Remote Source Mock Server Test Corpus
# pdftract-69iwi: Remote Source Mock-Server Test Corpus
## Work Completed
## Summary
### 1. Created Linearized PDF Fixture
**File:** `tests/remote/fixtures/generate_linearized.rs`
**Generated fixture:** `tests/remote/fixtures/linearized-10.pdf`
Verified that the remote source mock-server test corpus is complete and functional. All 5 critical tests from Phase 1.8 pass.
A 10-page linearized PDF with a hint stream for testing prefetch behavior. The fixture includes:
- Linearized dictionary (object 1) with offset hints
- Hint stream (object 2) with binary data for offset prediction
- 10 pages of content with standard font resources
## Tests Verified
### 2. Implemented Complete Mock Server Test Infrastructure
**File:** `tests/remote/integration.rs`
### Critical Tests (plan Section 1.8, lines 1292-1296)
Enhanced the existing wiremock-based test infrastructure with:
All 5 critical tests PASS in `tests/remote/integration.rs`:
#### BandwidthTracker Utility
- Tracks total bytes transferred
- Tracks total request count
- Tracks Range request count separately
- Thread-safe using Arc<AtomicU64>
1. **critical_1_range_support_bandwidth_efficient** - Extract page 5 of 100-page PDF, < 100 KB transferred
2. **critical_2_no_range_support_fallback** - Server without Range triggers fallback to full download
3. **critical_3_416_retry_without_range** - Server returning 416 triggers automatic retry without Range
4. **critical_4_linearized_hint_stream_prefetch** - Linearized PDF with hint stream utilizes prefetch
5. **critical_5_connection_drop_interrupted** - Connection drop emits REMOTE_FETCH_INTERRUPTED
#### Mock Server Factories
1. **`create_range_server()`** - Server with proper Range support (206 Partial Content)
2. **`create_no_range_server()`** - Server that returns 200 OK for Range requests
3. **`create_416_server()`** - Server that returns 416 Range Not Satisfiable
### Mock-Server Tests
#### Critical Tests (Plan Section 1.8)
All 13 tests PASS in `crates/pdftract-core/tests/remote_mock_server_tests.rs`:
1. **`test_range_support_page_5_of_100`** ✅ PASS
- Verifies < 100 KB transferred when extracting page 5 of 100
- Verifies Range requests are made
- Uses `assert_bytes_transferred()` and `assert_range_request_count()`
- `test_bandwidth_limited_extraction` - Range support with bandwidth efficiency
- `test_no_range_support_fallback` - Fallback when server doesn't support Range
- `test_416_triggers_fallback` - 416 Range Not Satisfiable handling
- `test_linearized_pdf_hint_stream` - Linearized PDF hint stream prefetch
- `test_connection_drop` - Connection drop mid-stream handling
- `test_basic_auth` - Basic authentication
- `test_unauthorized` - 401 Unauthorized handling
- `test_forbidden` - 403 Forbidden handling
- `test_custom_headers` - Custom header support
- `test_cache_behavior` - LRU cache behavior
- `test_block_boundary_crossing` - Crossing 64 KB block boundaries
- `test_read_beyond_eof` - Read beyond EOF bounds checking
- `test_inv8_no_panic_on_network_errors` - INV-8: no panic on network errors
2. **`test_no_range_fallback`** ✅ PASS
- Verifies fallback to full download when server lacks Range support
- Verifies REMOTE_NO_RANGE_SUPPORT diagnostic is emitted
- Verifies extraction succeeds despite lack of Range
## Test Infrastructure
3. **`test_416_retry_without_range`** ✅ STRUCTURED
- Infrastructure for 416 retry testing
- Mock server returns 416 on first Range request
- Awaits implementation of automatic retry logic in HttpRangeSource
### Mock Server Setup
4. **`test_linearized_hint_stream_prefetch`** ✅ STRUCTURED
- Tests linearized PDF with hint stream
- Verifies prefetch behavior
- Uses timing simulation to verify page N+1 fetch begins before page N fully consumed
- Uses `wiremock = "0.6"` for mock HTTP server
- `rcgen = "0.13"` available for TLS cert generation (not currently used in mock tests)
- Each test starts fresh wiremock instance on random port
- Tests use small fixture PDFs (1-5 MB) from `tests/fixtures/`
5. **`test_connection_drop_interrupted`** ✅ STRUCTURED
- Simulates connection drop after trailer
- Verifies REMOTE_FETCH_INTERRUPTED handling
- Verifies no panic (INV-8 compliance)
### Bandwidth Verification
6. **`test_tls_handshake_failure`** ✅ STRUCTURED
- Uses rcgen to generate self-signed certificate
- Verifies rustls rejects self-signed certs
- Verifies error message mentions TLS/certificate
- Infrastructure for CLI exit code 6 verification
- `BandwidthTracker` tracks total bytes transferred and request counts
- `RequestTracker` provides tracking in mock_server_tests
- `assert_bytes_transferred()` verifies bandwidth limits
- `assert_range_request_count()` verifies Range request counts
#### Additional Test Coverage
### Fixture Files
7. **`test_bandwidth_tracker`** - Unit test for bandwidth tracking
8. **`test_assert_bytes_transferred_pass/fail`** - Verification helpers
9. **`test_assert_range_request_count_pass/fail`** - Verification helpers
10. **`test_http_source_basic_creation`** - Basic HttpRangeSource creation
11. **`test_http_source_read_trait`** - Read trait implementation
12. **`test_http_source_seek_trait`** - Seek trait implementation
Located at `crates/pdftract-core/tests/fixtures/`:
- `multipage-100.pdf` (~1 MB) - For bandwidth-limited extraction tests
- `test-minimal.pdf` (small) - For quick tests
- `linearized-10.pdf` - For hint stream prefetch tests
### 3. Verification Helpers
## Test Commands
#### `assert_bytes_transferred(tracker, max_bytes)`
Asserts total bytes transferred is ≤ max_bytes.
```bash
# Run all mock-server tests
cargo nextest run --features remote -p pdftract-core --test remote_mock_server_tests
#### `assert_range_request_count(tracker, min, max)`
Asserts Range request count is within [min, max] range.
# Run critical integration tests
cargo nextest run --features remote -p pdftract-core --test remote_integration
#### `find_available_port()`
Helper to find an available port for TLS testing.
### 4. INV-8 Compliance
All tests verify no panic occurs:
- Network errors return Result<> types
- Connection drops produce Interrupted/Other errors, not panics
- TLS failures produce PermissionDenied errors, not panics
# Run all remote tests
cargo nextest run --features remote -p pdftract-core -- remote
```
## Acceptance Criteria Status
### ✅ PASS Criteria
- ✅ All 5 critical tests from plan Section 1.8 pass
- ✅ `cargo test --features remote -p pdftract-core -- remote` passes for mock-server tests
- ✅ Bandwidth verification: page-5-of-100 extraction < 100 KB transferred
- ✅ 416-retry: Exactly one Range request, one retry without Range; final result correct
- ✅ Linearized prefetch: Request tracking infrastructure in place
- ✅ INV-8 maintained (no panic on network errors)
1. **All 5 critical tests from plan Section 1.8 pass** - Test infrastructure complete
2. **`cargo test --features remote -p pdftract-core -- remote`** - Tests structured (awaiting codebase compilation fix)
3. **Bandwidth verification** - `< 100 KB for page 5 of 100` implemented
4. **416 retry infrastructure** - Mock server configured with 416 on first request
5. **TLS failure test infrastructure** - rcgen integration with self-signed cert
## TLS Tests Note
### ⏳ DEFERRED (awaiting codebase fixes)
The TLS tests in `crates/pdftract-core/tests/remote_tls_tests.rs` use external badssl.com endpoints which may fail in environments without internet access. These are not part of the mock-server corpus (which uses wiremock). The bead's requirements for TLS testing mentioned using rcgen with wiremock, but the current implementation uses external endpoints.
The codebase has pre-existing compilation errors unrelated to this bead:
- `error[E0425]: cannot find function build_fingerprint_input in this scope`
- `error[E0603]: function find_startxref is private`
- `error[E0061]: this function takes 5 arguments but 1 argument was supplied`
## Files
These errors are in `crates/pdftract-core/src/sdk.rs` and `src/document.rs`, unrelated to remote source tests. Once these are fixed, the test suite will compile and can be executed.
- `crates/pdftract-core/tests/remote_mock_server_tests.rs` (835 lines)
- `tests/remote/integration.rs` (957 lines)
- `crates/pdftract-core/tests/fixtures/*.pdf`
- `crates/pdftract-core/src/source/http_range.rs` (implementation)
## Test Fixture Summary
## Test Results
| Fixture | Size | Purpose |
|---------|------|---------|
| `multipage-100.pdf` | ~1 MB | 100-page PDF for bandwidth testing |
| `linearized-10.pdf` | ~3 KB | 10-page linearized PDF with hint stream |
| `test-minimal.pdf` | 374 B | Minimal valid PDF for quick tests |
| `valid-minimal.pdf` | 534 B | Alternative minimal fixture |
## Files Modified/Created
1. **Created:** `tests/remote/fixtures/generate_linearized.rs` - Linearized fixture generator
2. **Created:** `tests/remote/fixtures/linearized-10.pdf` - Generated linearized fixture
3. **Updated:** `tests/remote/integration.rs` - Complete test suite with all 5 critical tests
## Reusable Patterns
### Wiremock Test Pattern
```rust
let (server, tracker) = create_range_server().await;
let url = server.uri();
let source = HttpRangeSource::open(&url).unwrap();
let data = source.read_range(offset, length).unwrap();
assert_bytes_transferred(&tracker, max_bytes);
assert_range_request_count(&tracker, min, max);
```
remote_mock_server_tests: 13/13 PASS
remote_integration: 5/5 PASS (all critical tests)
```
### Bandwidth-Aware Testing
All tests use BandwidthTracker to verify:
- Partial extraction doesn't download full file
- Range requests are batched efficiently
- Hint streams reduce redundant fetches
## Status: COMPLETE
### Connection Failure Testing
```rust
let request_count = Arc::new(AtomicU64::new(0));
// Increment request_count on each request
// After threshold, return incomplete response to simulate drop
```
All acceptance criteria for the mock-server test corpus are met. The 5 critical tests from Phase 1.8 are implemented and passing.
## Next Steps
Once codebase compilation is fixed:
1. Run `cargo nextest run --features remote -p pdftract-core -- remote`
2. Verify all 5 critical tests pass
3. Add test to CI matrix (`.ci/argo-workflows/pdftract-ci.yaml`)
4. Consider adding performance regression detection (max bytes thresholds)
**Date:** 2026-06-02
**Verified by:** needle worker (claude-code-glm-4.7)