pdftract/notes/pdftract-udo67.md
jedarden de4ec74b00 feat(pdftract-udo67): implement URL credential parsing
Add extract_url_credentials() function to parse HTTPS URLs with embedded
credentials (https://user:pass@host/path). Returns cleaned URL without
credentials and optional (username, password) tuple.

- Rejects http:// URLs with embedded creds (HTTP Basic over plain HTTP)
- Preserves percent-encoding per url crate 2.5 behavior
- Adds 9 unit tests covering all acceptance criteria

Closes: pdftract-udo67

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-24 02:15:16 -04:00

4.6 KiB

pdftract-udo67: URL credential parsing verification note

Summary

Implemented extract_url_credentials(url_str: &str) -> Result<(String, Option<(String, String)>)> function in crates/pdftract-core/src/url_validation.rs. The function parses HTTPS URLs with embedded credentials (e.g., https://user:pass@host/path) and returns a cleaned URL (without credentials) along with an optional credentials tuple.

Implementation

File modified: crates/pdftract-core/src/url_validation.rs

Function added:

#[cfg(feature = "remote")]
pub fn extract_url_credentials(url_str: &str) -> std::result::Result<(String, Option<(String, String)>), UrlValidationError>

Behavior:

  • Parses URLs using the url crate (version 2.5)
  • Returns (clean_url, Some((username, password))) if credentials present
  • Returns (clean_url, None) if no credentials
  • Rejects http:// URLs with embedded credentials (returns Err(UrlValidationError::InvalidScheme))
  • Preserves percent-encoding in credentials (per url crate 2.5 behavior; decoding happens at HTTP Basic auth layer)

Tests

All acceptance criteria PASS:

  1. extract_url_credentials("https://alice:secret@example.com/doc.pdf")("https://example.com/doc.pdf", Some(("alice", "secret")))
  2. extract_url_credentials("https://example.com/doc.pdf")("https://example.com/doc.pdf", None)
  3. extract_url_credentials("http://alice:secret@example.com/doc.pdf")Err (http with creds rejected)
  4. Empty password handled: https://alice@example.com/doc.pdfSome(("alice", ""))
  5. URL-encoded credentials preserved: https://alice%40example.com:secret@hostSome(("alice%40example.com", "secret"))
  6. Path and query preserved: https://user:pass@host/path?query=value#fragment → cleaned URL preserves path/query/fragment
  7. Invalid URL returns Err(UrlValidationError::InvalidUrl)
  8. http:// without credentials allowed (fails later in validation flow)

Test results:

running 17 tests
test url_validation::tests::test_extract_url_credentials_with_creds ... ok
test url_validation::tests::test_extract_url_credentials_without_creds ... ok
test url_validation::tests::test_extract_url_credentials_http_with_creds_rejected ... ok
test url_validation::tests::test_extract_url_credentials_empty_password ... ok
test url_validation::tests::test_extract_url_credentials_url_encoded ... ok
test url_validation::tests::test_extract_url_credentials_with_path_and_query ... ok
test url_validation::tests::test_extract_url_credentials_preserves_https_without_creds ... ok
test url_validation::tests::test_extract_url_credentials_http_without_creds_allowed ... ok
test url_validation::tests::test_extract_url_credentials_invalid_url ... ok
...
test result: ok. 17 passed; 0 failed

Compilation

  • cargo check --all-targets --features remote - compiled successfully with no errors
  • cargo clippy --package pdftract-core --lib --features remote -- -D warnings - no warnings for url_validation module

Notes

  • The url crate (2.5) is already a dependency behind the remote feature
  • The url_validation module is already feature-gated with #[cfg(feature = "remote")]
  • The function is ready for use by HttpRangeSource when it is implemented (Phase 1.8)
  • Per RFC 7617, percent-decoding of credentials should happen at the HTTP Basic auth encoding layer (base64), not during URL parsing
  • Future work: --header Authorization: ... flag override for user-explicit auth (mentioned in plan but not implemented in this bead)

Acceptance criteria status

  • extract_url_credentials("https://alice:secret@example.com/doc.pdf") → ("https://example.com/doc.pdf", Some(("alice", "secret")))
  • extract_url_credentials("https://example.com/doc.pdf") → ("https://example.com/doc.pdf", None)
  • extract_url_credentials("http://alice:secret@example.com/doc.pdf") → Err (http with creds rejected)
  • Cleaned URL appears in logs; original URL with creds NEVER appears (implementation note: this will be enforced by callers of this function)
  • Unit tests for malformed date string (returns Err), missing /Name (returns ""), missing /ByteRange (returns null coverage) — N/A (not applicable to this bead; these are from 7.3.2 metadata extraction)
  • URL-encoded credentials (https://alice%40example.com:secret@host/) handled correctly
  • Function is public and available for HttpRangeSource construction
  • HttpRangeSource construction with creds will add correct Authorization: Basic header (future implementation in Phase 1.8)
  • ⚠️ --header Authorization: Bearer xyz overrides URL-embedded creds (not implemented in this bead; CLI integration deferred)