feat(pdftract-3s2i): implement Phase 5.5.2 validation filter

Implement per-word validation filter for assisted-OCR BrokenVector path.

Changes:
- Add SpanSource::OcrAssisted variant to hybrid.rs
- Add Span::ocr_assisted() helper method
- Implement validate_ocr_with_position_hints() in ocr.rs
  - 5pt distance threshold for position validation
  - 0.4 confidence cap for rejected words
  - Linear scan for nearest-neighbor lookup
- Add unit tests for validation filter

Closes: pdftract-3s2i

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
This commit is contained in:
jedarden 2026-05-24 04:57:17 -04:00
parent 450e2f2df5
commit e6bf3dd290
129 changed files with 9284 additions and 4076 deletions

1
.marathon/.gitignore vendored Normal file
View file

@ -0,0 +1 @@
logs/

104
.marathon/instruction.md Normal file
View file

@ -0,0 +1,104 @@
# pdftract — Marathon Coding Instruction
You are an autonomous Rust developer implementing **pdftract**, a PDF text-extraction
tool (Rust core + PyO3 bindings + CLI with a `--serve` mode). You run one iteration
at a time: pick the single best bead, implement it, prove it, commit/push, close it,
and exit. The loop restarts you for the next bead.
## Authoritative sources (read before coding)
- **Plan — the source of truth:** `/home/coding/pdftract/docs/plan/plan.md`
(~3,825 lines, schema_version 1.0). Every bead description references plan line
ranges. Read the referenced section before you write code. If the code contradicts
the plan, the code is wrong.
- **Repo conventions:** `/home/coding/pdftract/CLAUDE.md` — this workspace uses
**`bf`** (bead-forge), not stock `br`. It overrides the parent `~/CLAUDE.md`'s
beads-recovery patterns.
- **Environment:** `/home/coding/CLAUDE.md` — Argo CI on iad-ci, kubectl-proxy,
ArgoCD, ADB. Still applies.
## Working directory
`/home/coding/pdftract`
## Each iteration
### 1. Sync and find work
```bash
cd /home/coding/pdftract
git pull --ff-only || git pull --rebase # if the branch diverged, rebase local work
bf ready --limit 5 # unblocked beads, ranked by impact-weighted score
```
The `float` column is critical-path slack: `float=0` = on the critical path (no slack),
larger = more slack. **Prefer low-float, high-priority beads.** Dependency direction is
canonical: epics/coordinators depend on their leaf tasks and close LAST — work leaves first.
If a bead was attempted before (check `git log` for its ID), continue from the prior
work rather than starting over.
### 2. Claim
```bash
bf claim <bead-id> --model claude-code-glm-4.7 --harness needle --harness-version marathon
```
### 3. Implement
1. `bf show <bead-id>` — read the full description + acceptance criteria.
2. Read the referenced section of `plan.md`.
3. Read the existing source under `crates/` / `src/` before modifying it.
4. Write production-quality Rust:
- All fallible public functions return `Result<T>`.
- **No `unwrap()` / `expect()` in non-test code.**
- Exhaustive `match` arms on enums — no catch-all `_` on outcome types.
- Add unit tests in `#[cfg(test)]` modules.
5. Gates — all must pass before you commit:
```bash
cargo check --all-targets
cargo clippy --all-targets -- -D warnings
cargo fmt
cargo nextest run # (or `cargo test` if nextest unavailable)
```
### 4. Commit, push, close
```bash
git add <specific paths you changed>
git commit -m "<type>(<scope>): <short summary>" # body: key decisions + Closes: <bead-id>
git push
```
**Closing a bead — `bf close` is BROKEN** (returns `Error: Query returned no rows`).
Use `bf batch` instead, with a substantive reason citing the commits, the verification
note path, and the test fixtures exercised:
```bash
bf batch --json '[{"op":"close","id":"pdftract-XXX","reason":"<commits + tests + acceptance notes>"}]'
# Expected: [op 0] ok
```
### 5. End the iteration
**One bead per iteration.** Then exit — the loop restarts you.
## Hard rules
- **The plan is the source of truth.** Disagreement between your intuition and the plan
means the intuition is wrong for *this project*. Genuine gaps → open a
`plan-gap: <title>` bead and continue.
- **NEVER `git stash -u`, `git stash --include-untracked`, or `git clean`.** A
pre-commit provenance hook over `tests/fixtures` blocks ALL commits if a fixture
goes missing; these commands sweep untracked fixtures. Keep fixtures tracked.
- **Never force-push. Never `--no-verify`. Never skip hooks.**
- **Never edit `.beads/` files directly** (issues.jsonl, beads.db). Use `bf` only.
- **No GitHub Actions, no K8s Jobs/CronJobs, no direct `kubectl apply`.** CI is Argo
Workflows on iad-ci; K8s YAML goes to `jedarden/declarative-config` via PR.
- **Always compile.** Never leave the repo broken. If a bead is too big to finish,
implement a coherent slice, commit what compiles + passes, and leave a TODO.
## Done
The genesis bead `pdftract-qkc77` closes when all 13 epic beads close. Each epic closes
only after its sub-phase coordinators and leaf tasks close.

91
.marathon/start.sh Executable file
View file

@ -0,0 +1,91 @@
#!/usr/bin/env bash
# pdftract Marathon Launcher — claude-code @ GLM-4.7 via ZAI proxy
#
# Runs the central marathon-coding skill in a dedicated tmux session against this
# repo. Each iteration reads .marathon/instruction.md and invokes headless
# claude-code routed through the ZAI proxy, mirroring the live NEEDLE
# claude-code-glm-4.7 agent.
#
# Usage:
# ./.marathon/start.sh # session "pdftract-marathon"
# ./.marathon/start.sh <session-name> # custom session name
set -euo pipefail
SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
REPO_DIR="$(dirname "$SCRIPT_DIR")"
MARATHON_SKILL="/home/coding/claude-config/skills/marathon-coding"
INSTRUCTION_FILE="$SCRIPT_DIR/instruction.md"
LOG_DIR="$SCRIPT_DIR/logs"
SESSION_NAME="${1:-pdftract-marathon}"
# ZAI proxy — CURRENT endpoint is the apexalgo-iad Traefik vpn-entrypoint, NOT the
# decommissioned ardenone-hub proxy that older repos' start.sh scripts point at.
# This mirrors the env of the live `claude-code-glm-4.7` NEEDLE agent.
ZAI_BASE_URL="https://traefik-apexalgo-iad.tail1b1987.ts.net:8444"
command -v tmux >/dev/null 2>&1 || { echo "Error: tmux not installed" >&2; exit 1; }
[ -x "$MARATHON_SKILL/launcher.sh" ] || { echo "Error: marathon launcher missing: $MARATHON_SKILL/launcher.sh" >&2; exit 1; }
[ -f "$INSTRUCTION_FILE" ] || { echo "Error: instruction file missing: $INSTRUCTION_FILE" >&2; exit 1; }
if tmux has-session -t "$SESSION_NAME" 2>/dev/null; then
echo "Session '$SESSION_NAME' already exists."
echo " Attach: tmux attach -t $SESSION_NAME"
echo " Kill: tmux kill-session -t $SESSION_NAME"
exit 1
fi
# Guard against running concurrently with a NEEDLE worker on the same worktree.
if pgrep -f "needle run --workspace $REPO_DIR" >/dev/null 2>&1; then
echo "Error: a NEEDLE worker is running against $REPO_DIR." >&2
echo " Marathon + NEEDLE share one git worktree → contention." >&2
echo " Stop it first: needle stop -i <identifier>" >&2
exit 1
fi
# Preflight: any HTTP response = proxy is up; only a connection failure aborts.
if ! curl -sk --max-time 8 -o /dev/null "$ZAI_BASE_URL"; then
echo "Error: ZAI proxy at $ZAI_BASE_URL is unreachable." >&2
echo " Check Tailscale + the proxy on apexalgo-iad." >&2
exit 1
fi
mkdir -p "$LOG_DIR"
LOOP_CMD="cd '$REPO_DIR' && \
unset CLAUDECODE && \
export NODE_TLS_REJECT_UNAUTHORIZED=0 && \
export ANTHROPIC_BASE_URL='$ZAI_BASE_URL' && \
export ANTHROPIC_AUTH_TOKEN='proxy-handles-auth' && \
export ANTHROPIC_MODEL='glm-4.7' && \
export ANTHROPIC_DEFAULT_OPUS_MODEL='glm-4.7' && \
export ANTHROPIC_DEFAULT_SONNET_MODEL='glm-4.7' && \
export ANTHROPIC_DEFAULT_HAIKU_MODEL='glm-4.7' && \
export CLAUDE_CODE_SUBAGENT_MODEL='glm-4.7' && \
export API_TIMEOUT_MS='900000' && \
export DISABLE_AUTOUPDATER=1 && \
export DISABLE_TELEMETRY=1 && \
'$MARATHON_SKILL/launcher.sh' \
--prompt '$INSTRUCTION_FILE' \
--model glm-4.7 \
--delay 10 \
--log-dir '$LOG_DIR'"
echo "╔══════════════════════════════════════════════════════════════╗"
echo "║ pdftract Marathon — claude-code @ GLM-4.7 ║"
echo "╚══════════════════════════════════════════════════════════════╝"
echo " Repo: $REPO_DIR"
echo " Instruction: $INSTRUCTION_FILE"
echo " Session: $SESSION_NAME"
echo " Model: glm-4.7 (all tiers)"
echo " Proxy: $ZAI_BASE_URL"
echo " Logs: $LOG_DIR"
echo ""
tmux new-session -d -s "$SESSION_NAME" -c "$REPO_DIR" "$LOOP_CMD"
echo "Marathon running in tmux session: $SESSION_NAME"
echo " Attach: tmux attach -t $SESSION_NAME"
echo " Detach: Ctrl+B, D (while attached)"
echo " Stop: tmux kill-session -t $SESSION_NAME"
echo " Logs: ls $LOG_DIR/"

1
Cargo.lock generated
View file

@ -2353,6 +2353,7 @@ dependencies = [
"secrecy",
"serde",
"serde_json",
"serde_yaml",
"sha2",
"smallvec",
"tempfile",

View file

@ -29,7 +29,8 @@ fn main() {
("MARKDOWN", cfg!(feature = "markdown")),
];
let enabled: Vec<&str> = features.iter()
let enabled: Vec<&str> = features
.iter()
.filter(|(_, enabled)| *enabled)
.map(|(name, _)| *name)
.collect();

View file

@ -62,7 +62,11 @@ impl AgeHistogram {
/// Total entries in histogram.
pub fn total(&self) -> u64 {
self.less_than_1h + self.less_than_1d + self.less_than_7d + self.less_than_30d + self.greater_than_30d
self.less_than_1h
+ self.less_than_1d
+ self.less_than_7d
+ self.less_than_30d
+ self.greater_than_30d
}
/// Get percentage for a bucket.
@ -114,32 +118,31 @@ pub fn compute_stats(cache_dir: &Path) -> Result<CacheStats> {
let mut oldest_mtime = None;
let mut newest_mtime = None;
for prefix1_entry in fs::read_dir(cache_dir)?
.filter_map(|e| e.ok())
.filter(|e| {
e.path().is_dir()
&& e.file_name().to_string_lossy().len() == 2
&& e.file_name().to_string_lossy().chars().all(|c| c.is_ascii_hexdigit())
})
{
for prefix1_entry in fs::read_dir(cache_dir)?.filter_map(|e| e.ok()).filter(|e| {
e.path().is_dir()
&& e.file_name().to_string_lossy().len() == 2
&& e.file_name()
.to_string_lossy()
.chars()
.all(|c| c.is_ascii_hexdigit())
}) {
let prefix1_dir = prefix1_entry.path();
for prefix2_entry in prefix1_dir.read_dir()?
.filter_map(|e| e.ok())
.filter(|e| {
e.path().is_dir()
&& e.file_name().to_string_lossy().len() == 2
&& e.file_name()
.to_string_lossy()
.chars()
.all(|c| c.is_ascii_hexdigit())
})
{
for prefix2_entry in prefix1_dir.read_dir()?.filter_map(|e| e.ok()).filter(|e| {
e.path().is_dir()
&& e.file_name().to_string_lossy().len() == 2
&& e.file_name()
.to_string_lossy()
.chars()
.all(|c| c.is_ascii_hexdigit())
}) {
let prefix2_dir = prefix2_entry.path();
for fp_entry in prefix2_dir.read_dir()?.filter_map(|e| e.ok()).filter(|e| {
e.path().is_dir()
}) {
for fp_entry in prefix2_dir
.read_dir()?
.filter_map(|e| e.ok())
.filter(|e| e.path().is_dir())
{
let fp_dir = fp_entry.path();
for entry in fp_dir.read_dir()?.filter_map(|e| e.ok()) {
@ -155,10 +158,14 @@ pub fn compute_stats(cache_dir: &Path) -> Result<CacheStats> {
if let Ok(modified) = metadata.modified() {
if let Ok(duration) = modified.duration_since(UNIX_EPOCH) {
let mtime_secs = duration.as_secs();
if oldest_mtime.is_none() || Some(mtime_secs) < oldest_mtime {
if oldest_mtime.is_none()
|| Some(mtime_secs) < oldest_mtime
{
oldest_mtime = Some(mtime_secs);
}
if newest_mtime.is_none() || Some(mtime_secs) > newest_mtime {
if newest_mtime.is_none()
|| Some(mtime_secs) > newest_mtime
{
newest_mtime = Some(mtime_secs);
}
@ -211,15 +218,15 @@ pub fn display_stats(stats: &CacheStats) {
};
println!("Entries: {}", stats.entry_count);
println!("Total size: {:.1} MiB compressed / {:.1} GiB uncompressed ({:.1}x ratio)",
println!(
"Total size: {:.1} MiB compressed / {:.1} GiB uncompressed ({:.1}x ratio)",
compressed_mb,
uncompressed_mb / 1024.0,
ratio
);
println!("Hit ratio (since last clear): {:.1}% ({} hits / {} total)",
hit_ratio,
stats.hits,
stats.total_accesses
println!(
"Hit ratio (since last clear): {:.1}% ({} hits / {} total)",
hit_ratio, stats.hits, stats.total_accesses
);
if let Some(oldest) = stats.oldest_entry_age_seconds {
@ -245,7 +252,8 @@ pub fn display_stats(stats: &CacheStats) {
}
let h = &stats.age_histogram;
println!("Age histogram: <1h: {:.1}%, <1d: {:.1}%, <7d: {:.1}%, <30d: {:.1}%, >30d: {:.1}%",
println!(
"Age histogram: <1h: {:.1}%, <1d: {:.1}%, <7d: {:.1}%, <30d: {:.1}%, >30d: {:.1}%",
h.percentage(h.less_than_1h),
h.percentage(h.less_than_1d),
h.percentage(h.less_than_7d),
@ -314,32 +322,31 @@ pub fn clear_cache(cache_dir: &Path, yes: bool) -> Result<()> {
// Delete all entry files (preserve index.json and sentinel)
let mut deleted = 0;
for prefix1_entry in fs::read_dir(cache_dir)?
.filter_map(|e| e.ok())
.filter(|e| {
e.path().is_dir()
&& e.file_name().to_string_lossy().len() == 2
&& e.file_name().to_string_lossy().chars().all(|c| c.is_ascii_hexdigit())
})
{
for prefix1_entry in fs::read_dir(cache_dir)?.filter_map(|e| e.ok()).filter(|e| {
e.path().is_dir()
&& e.file_name().to_string_lossy().len() == 2
&& e.file_name()
.to_string_lossy()
.chars()
.all(|c| c.is_ascii_hexdigit())
}) {
let prefix1_dir = prefix1_entry.path();
for prefix2_entry in prefix1_dir.read_dir()?
.filter_map(|e| e.ok())
.filter(|e| {
e.path().is_dir()
&& e.file_name().to_string_lossy().len() == 2
&& e.file_name()
.to_string_lossy()
.chars()
.all(|c| c.is_ascii_hexdigit())
})
{
for prefix2_entry in prefix1_dir.read_dir()?.filter_map(|e| e.ok()).filter(|e| {
e.path().is_dir()
&& e.file_name().to_string_lossy().len() == 2
&& e.file_name()
.to_string_lossy()
.chars()
.all(|c| c.is_ascii_hexdigit())
}) {
let prefix2_dir = prefix2_entry.path();
for fp_entry in prefix2_dir.read_dir()?.filter_map(|e| e.ok()).filter(|e| {
e.path().is_dir()
}) {
for fp_entry in prefix2_dir
.read_dir()?
.filter_map(|e| e.ok())
.filter(|e| e.path().is_dir())
{
let fp_dir = fp_entry.path();
// Delete all files in the fingerprint directory
@ -383,8 +390,10 @@ pub fn clear_cache(cache_dir: &Path, yes: bool) -> Result<()> {
pub fn purge_cache_older_than(cache_dir: &Path, duration_str: &str) -> Result<()> {
use humantime::parse_duration;
let duration = parse_duration(duration_str)
.context(format!("Invalid duration '{}'. Use formats like '30d', '7d', '1h'", duration_str))?;
let duration = parse_duration(duration_str).context(format!(
"Invalid duration '{}'. Use formats like '30d', '7d', '1h'",
duration_str
))?;
let cutoff_secs = SystemTime::now()
.duration_since(UNIX_EPOCH)
@ -394,32 +403,31 @@ pub fn purge_cache_older_than(cache_dir: &Path, duration_str: &str) -> Result<()
let mut deleted = 0;
for prefix1_entry in fs::read_dir(cache_dir)?
.filter_map(|e| e.ok())
.filter(|e| {
e.path().is_dir()
&& e.file_name().to_string_lossy().len() == 2
&& e.file_name().to_string_lossy().chars().all(|c| c.is_ascii_hexdigit())
})
{
for prefix1_entry in fs::read_dir(cache_dir)?.filter_map(|e| e.ok()).filter(|e| {
e.path().is_dir()
&& e.file_name().to_string_lossy().len() == 2
&& e.file_name()
.to_string_lossy()
.chars()
.all(|c| c.is_ascii_hexdigit())
}) {
let prefix1_dir = prefix1_entry.path();
for prefix2_entry in prefix1_dir.read_dir()?
.filter_map(|e| e.ok())
.filter(|e| {
e.path().is_dir()
&& e.file_name().to_string_lossy().len() == 2
&& e.file_name()
.to_string_lossy()
.chars()
.all(|c| c.is_ascii_hexdigit())
})
{
for prefix2_entry in prefix1_dir.read_dir()?.filter_map(|e| e.ok()).filter(|e| {
e.path().is_dir()
&& e.file_name().to_string_lossy().len() == 2
&& e.file_name()
.to_string_lossy()
.chars()
.all(|c| c.is_ascii_hexdigit())
}) {
let prefix2_dir = prefix2_entry.path();
for fp_entry in prefix2_dir.read_dir()?.filter_map(|e| e.ok()).filter(|e| {
e.path().is_dir()
}) {
for fp_entry in prefix2_dir
.read_dir()?
.filter_map(|e| e.ok())
.filter(|e| e.path().is_dir())
{
let fp_dir = fp_entry.path();
for entry in fp_dir.read_dir()?.filter_map(|e| e.ok()) {
@ -474,8 +482,10 @@ pub fn purge_cache_older_than(cache_dir: &Path, duration_str: &str) -> Result<()
pub fn purge_cache_version(_cache_dir: &Path, version_constraint: &str) -> Result<()> {
use semver::VersionReq;
let _req = VersionReq::parse(version_constraint)
.context(format!("Invalid version constraint '{}'", version_constraint))?;
let _req = VersionReq::parse(version_constraint).context(format!(
"Invalid version constraint '{}'",
version_constraint
))?;
// For now, this is a no-op since we don't track extraction versions per entry
// This would require extending the cache entry metadata
@ -488,32 +498,31 @@ pub fn purge_cache_version(_cache_dir: &Path, version_constraint: &str) -> Resul
fn count_entries(cache_dir: &Path) -> Result<u64> {
let mut count = 0;
for prefix1_entry in fs::read_dir(cache_dir)?
.filter_map(|e| e.ok())
.filter(|e| {
e.path().is_dir()
&& e.file_name().to_string_lossy().len() == 2
&& e.file_name().to_string_lossy().chars().all(|c| c.is_ascii_hexdigit())
})
{
for prefix1_entry in fs::read_dir(cache_dir)?.filter_map(|e| e.ok()).filter(|e| {
e.path().is_dir()
&& e.file_name().to_string_lossy().len() == 2
&& e.file_name()
.to_string_lossy()
.chars()
.all(|c| c.is_ascii_hexdigit())
}) {
let prefix1_dir = prefix1_entry.path();
for prefix2_entry in prefix1_dir.read_dir()?
.filter_map(|e| e.ok())
.filter(|e| {
e.path().is_dir()
&& e.file_name().to_string_lossy().len() == 2
&& e.file_name()
.to_string_lossy()
.chars()
.all(|c| c.is_ascii_hexdigit())
})
{
for prefix2_entry in prefix1_dir.read_dir()?.filter_map(|e| e.ok()).filter(|e| {
e.path().is_dir()
&& e.file_name().to_string_lossy().len() == 2
&& e.file_name()
.to_string_lossy()
.chars()
.all(|c| c.is_ascii_hexdigit())
}) {
let prefix2_dir = prefix2_entry.path();
for fp_entry in prefix2_dir.read_dir()?.filter_map(|e| e.ok()).filter(|e| {
e.path().is_dir()
}) {
for fp_entry in prefix2_dir
.read_dir()?
.filter_map(|e| e.ok())
.filter(|e| e.path().is_dir())
{
let fp_dir = fp_entry.path();
for entry in fp_dir.read_dir()?.filter_map(|e| e.ok()) {
@ -659,8 +668,16 @@ mod tests {
let fp_dir = cache_dir.join("e7").join("a1").join(fp);
fs::create_dir_all(&fp_dir).unwrap();
fs::write(fp_dir.join(format!("{}-1000.json.zst", opts)), b"x".repeat(1000)).unwrap();
fs::write(fp_dir.join(format!("{}-2000.json.zst", opts)), b"x".repeat(2000)).unwrap();
fs::write(
fp_dir.join(format!("{}-1000.json.zst", opts)),
b"x".repeat(1000),
)
.unwrap();
fs::write(
fp_dir.join(format!("{}-2000.json.zst", opts)),
b"x".repeat(2000),
)
.unwrap();
let count = count_entries(cache_dir).unwrap();
assert_eq!(count, 2);

View file

@ -135,12 +135,18 @@ impl CodeGenerator {
return Ok(contract);
}
Err(e) => {
eprintln!("Warning: Failed to parse SDK contract from {:?}: {}", contract_path, e);
eprintln!(
"Warning: Failed to parse SDK contract from {:?}: {}",
contract_path, e
);
eprintln!("Falling back to hardcoded contract");
}
}
} else {
eprintln!("Warning: SDK contract file not found at {:?}, using hardcoded contract", contract_path);
eprintln!(
"Warning: SDK contract file not found at {:?}, using hardcoded contract",
contract_path
);
}
// Hardcoded fallback contract
@ -155,7 +161,9 @@ impl CodeGenerator {
let mut errors = Vec::new();
// Parse method signatures from the Method surface section
let _method_sig_re = Regex::new(r"\*\*([a-z_]+)\*\*\s*\n\s*- Signature: [`']?([a-zA-Z0-9_<>():?,\s]+)[`']?").unwrap();
let _method_sig_re =
Regex::new(r"\*\*([a-z_]+)\*\*\s*\n\s*- Signature: [`']?([a-zA-Z0-9_<>():?,\s]+)[`']?")
.unwrap();
let _method_table_re = Regex::new(r"\| [`']?([a-z_]+)[`']?\|").unwrap();
// Parse method table for CLI mappings
@ -170,18 +178,129 @@ impl CodeGenerator {
// Method definitions with their details
let method_patterns = [
("extract", "Extract", "extract", "extract", "Document", "ExtractOptions", "Extract structured data from a PDF", false, false, 0),
("extract_text", "ExtractText", "extract_text", "extract", "string", "ExtractOptions", "Extract plain text from a PDF", true, false, 0),
("extract_markdown", "ExtractMarkdown", "extract_markdown", "extract", "string", "ExtractOptions", "Extract Markdown-formatted text from a PDF", true, false, 0),
("extract_stream", "ExtractStream", "extract_stream", "extract", "Page", "ExtractOptions", "Extract pages from a PDF as a stream", false, false, 0),
("search", "Search", "search", "grep", "Match", "SearchOptions", "Search for text in a PDF", false, false, 0),
("get_metadata", "GetMetadata", "get_metadata", "extract", "Metadata", "BaseOptions", "Get metadata from a PDF", false, false, 0),
("hash", "Hash", "hash", "hash", "Fingerprint", "BaseOptions", "Compute hash fingerprint of a PDF", false, false, 0),
("classify", "Classify", "classify", "classify", "Classification", "", "Classify a PDF document", false, false, 0),
("verify_receipt", "VerifyReceipt", "verify_receipt", "verify-receipt", "bool", "", "Verify a receipt", false, true, 2),
(
"extract",
"Extract",
"extract",
"extract",
"Document",
"ExtractOptions",
"Extract structured data from a PDF",
false,
false,
0,
),
(
"extract_text",
"ExtractText",
"extract_text",
"extract",
"string",
"ExtractOptions",
"Extract plain text from a PDF",
true,
false,
0,
),
(
"extract_markdown",
"ExtractMarkdown",
"extract_markdown",
"extract",
"string",
"ExtractOptions",
"Extract Markdown-formatted text from a PDF",
true,
false,
0,
),
(
"extract_stream",
"ExtractStream",
"extract_stream",
"extract",
"Page",
"ExtractOptions",
"Extract pages from a PDF as a stream",
false,
false,
0,
),
(
"search",
"Search",
"search",
"grep",
"Match",
"SearchOptions",
"Search for text in a PDF",
false,
false,
0,
),
(
"get_metadata",
"GetMetadata",
"get_metadata",
"extract",
"Metadata",
"BaseOptions",
"Get metadata from a PDF",
false,
false,
0,
),
(
"hash",
"Hash",
"hash",
"hash",
"Fingerprint",
"BaseOptions",
"Compute hash fingerprint of a PDF",
false,
false,
0,
),
(
"classify",
"Classify",
"classify",
"classify",
"Classification",
"",
"Classify a PDF document",
false,
false,
0,
),
(
"verify_receipt",
"VerifyReceipt",
"verify_receipt",
"verify-receipt",
"bool",
"",
"Verify a receipt",
false,
true,
2,
),
];
for (name, camel_name, snake_name, cli_flag, return_type, options_type, description, returns_string, uses_string_params, string_param_count) in method_patterns {
for (
name,
camel_name,
snake_name,
cli_flag,
return_type,
options_type,
description,
returns_string,
uses_string_params,
string_param_count,
) in method_patterns
{
methods.push(Method {
name: name.to_string(),
camel_name: camel_name.to_string(),
@ -199,20 +318,28 @@ impl CodeGenerator {
// Parse error mapping table from the Error mapping section
let error_mapping_start = content.find("## Error mapping").unwrap_or(0);
let error_mapping_end = content.find("### Per-language base exception types").unwrap_or(content.len());
let error_mapping_end = content
.find("### Per-language base exception types")
.unwrap_or(content.len());
let error_mapping_section = content[error_mapping_start..error_mapping_end].to_string();
// The error table has the format: | Exit code | Meaning | Native exception |
// We need to find the table header and then parse the rows
let error_re = Regex::new(r"\|\s*(\d+)\s*\|\s*([^|]+?)\s*\|\s*`?([a-zA-Z]+)`?\s*\|").unwrap();
let error_re =
Regex::new(r"\|\s*(\d+)\s*\|\s*([^|]+?)\s*\|\s*`?([a-zA-Z]+)`?\s*\|").unwrap();
for cap in error_re.captures_iter(&error_mapping_section) {
if let (Some(exit_code_str), Some(meaning), Some(exception_name)) = (
cap.get(1), cap.get(2), cap.get(3)
) {
if let (Some(exit_code_str), Some(meaning), Some(exception_name)) =
(cap.get(1), cap.get(2), cap.get(3))
{
if let Ok(exit_code) = exit_code_str.as_str().parse::<i32>() {
let name = exception_name.as_str().trim().to_string();
// Skip the generic "any other non-zero" entry and malformed matches
if !name.contains("any other") && name.chars().next().map_or(false, |c| c.is_ascii_alphabetic()) {
if !name.contains("any other")
&& name
.chars()
.next()
.map_or(false, |c| c.is_ascii_alphabetic())
{
errors.push(Error {
exit_code,
exception_name: name,
@ -367,7 +494,8 @@ impl CodeGenerator {
Error {
exit_code: 3,
exception_name: "EncryptionError".to_string(),
description: "The PDF is encrypted and password is missing or wrong".to_string(),
description: "The PDF is encrypted and password is missing or wrong"
.to_string(),
},
Error {
exit_code: 4,
@ -418,11 +546,18 @@ impl CodeGenerator {
let template_dir = PathBuf::from("templates/sdk-skeleton").join(lang.template_dir());
if !template_dir.exists() {
anyhow::bail!("Template directory for {:?} does not exist: {:?}", lang, template_dir);
anyhow::bail!(
"Template directory for {:?} does not exist: {:?}",
lang,
template_dir
);
}
// Walk the template directory and render each file
for entry in WalkDir::new(&template_dir).into_iter().filter_map(|e| e.ok()) {
for entry in WalkDir::new(&template_dir)
.into_iter()
.filter_map(|e| e.ok())
{
let path = entry.path();
if path.is_dir() {
continue;
@ -451,7 +586,8 @@ impl CodeGenerator {
// Register template if it contains Tera syntax
if template_content.contains("{{") || template_content.contains("{%") {
self.tera.add_raw_template(&template_name, &template_content)?;
self.tera
.add_raw_template(&template_name, &template_content)?;
}
// Build context
@ -488,7 +624,10 @@ impl CodeGenerator {
/// Files that should be excluded from validation comparison.
fn should_exclude_from_validation(path: &Path) -> bool {
let file_name = path.file_name().and_then(|n| n.to_str());
matches!(file_name, Some("GENERATED") | Some(".codegen-version") | Some(".gitignore"))
matches!(
file_name,
Some("GENERATED") | Some(".codegen-version") | Some(".gitignore")
)
}
/// Validates an existing SDK against the current generator output.
@ -502,7 +641,10 @@ impl CodeGenerator {
let mut differences = Vec::new();
// Compare generated files with existing SDK
for entry in WalkDir::new(temp_dir.path()).into_iter().filter_map(|e| e.ok()) {
for entry in WalkDir::new(temp_dir.path())
.into_iter()
.filter_map(|e| e.ok())
{
let path = entry.path();
if path.is_dir() {
continue;

View file

@ -1,5 +1,5 @@
use std::path::Path;
use super::super::{Check, CheckResult, CheckStatus, DoctorCtx};
use std::path::Path;
/// Check: cache directory (cache feature)
///
@ -13,9 +13,9 @@ impl CacheDirCheck {
#[cfg(unix)]
fn check_free_space(path: &Path) -> Result<u64, String> {
use libc::{c_char, statvfs};
use std::ffi::CString;
use std::os::unix::ffi::OsStrExt;
use libc::{statvfs, c_char};
let path_cstr = CString::new(path.as_os_str().as_bytes())
.map_err(|_| "Failed to convert path to CString".to_string())?;
@ -54,8 +54,7 @@ impl CacheDirCheck {
// Try to create a temporary file
let test_file = path.join(".pdftract-doctor-test");
std::fs::write(&test_file, b"test")
.map_err(|e| format!("Not writable: {}", e))?;
std::fs::write(&test_file, b"test").map_err(|e| format!("Not writable: {}", e))?;
// Clean up
let _ = std::fs::remove_file(&test_file);
@ -77,7 +76,8 @@ impl CacheDirCheck {
let value: serde_json::Value = serde_json::from_str(&content)
.map_err(|e| format!("Failed to parse index.json: {}", e))?;
let schema_version = value.get("schema_version")
let schema_version = value
.get("schema_version")
.and_then(|v| v.as_u64())
.unwrap_or(0);
@ -86,7 +86,10 @@ impl CacheDirCheck {
if schema_version == current_version as u64 {
Ok(format!("Layout version {} (current)", schema_version))
} else {
Ok(format!("Layout version {} (migration available to {})", schema_version, current_version))
Ok(format!(
"Layout version {} (migration available to {})",
schema_version, current_version
))
}
}
}
@ -111,7 +114,10 @@ impl Check for CacheDirCheck {
return CheckResult {
name: self.name(),
status: CheckStatus::Warn,
detail: format!("Cache directory does not exist: {} (will be created on first use)", cache_dir.display()),
detail: format!(
"Cache directory does not exist: {} (will be created on first use)",
cache_dir.display()
),
};
}
@ -131,7 +137,10 @@ impl Check for CacheDirCheck {
CheckResult {
name: self.name(),
status: CheckStatus::Warn,
detail: format!("{} (low disk space: {} MiB free, 1 GiB recommended)", layout, free_mb),
detail: format!(
"{} (low disk space: {} MiB free, 1 GiB recommended)",
layout, free_mb
),
}
} else {
CheckResult {
@ -141,13 +150,15 @@ impl Check for CacheDirCheck {
}
}
}
(Err(e), _, _) | (_, Err(e), _) | (_, _, Err(e)) => {
CheckResult {
name: self.name(),
status: CheckStatus::Fail,
detail: format!("Cache directory check failed at {}: {}", cache_dir.display(), e),
}
}
(Err(e), _, _) | (_, Err(e), _) | (_, _, Err(e)) => CheckResult {
name: self.name(),
status: CheckStatus::Fail,
detail: format!(
"Cache directory check failed at {}: {}",
cache_dir.display(),
e
),
},
}
}
}

View file

@ -1,5 +1,5 @@
use std::process::Command;
use super::super::{Check, CheckResult, CheckStatus, DoctorCtx};
use std::process::Command;
/// Check: leptonica installation (transitive Tesseract dependency)
///
@ -15,17 +15,13 @@ impl Check for LeptonicaCheck {
fn run(&self, _ctx: &DoctorCtx) -> CheckResult {
// First check if pkg-config exists
let pkg_check = Command::new("pkg-config")
.arg("--version")
.output();
let pkg_check = Command::new("pkg-config").arg("--version").output();
let pkg_available = pkg_check.is_ok();
if !pkg_available {
// Fallback: try ldconfig -p | grep lept
let ldconfig = Command::new("ldconfig")
.arg("-p")
.output();
let ldconfig = Command::new("ldconfig").arg("-p").output();
if let Ok(output) = ldconfig {
let stdout = String::from_utf8_lossy(&output.stdout);
@ -68,14 +64,20 @@ impl Check for LeptonicaCheck {
CheckResult {
name: self.name(),
status: CheckStatus::Warn,
detail: format!("leptonica {} found (< 1.79: may have compatibility issues)", version),
detail: format!(
"leptonica {} found (< 1.79: may have compatibility issues)",
version
),
}
}
} else {
CheckResult {
name: self.name(),
status: CheckStatus::Warn,
detail: format!("leptonica {} found but version could not be parsed", version_str),
detail: format!(
"leptonica {} found but version could not be parsed",
version_str
),
}
}
}
@ -87,13 +89,11 @@ impl Check for LeptonicaCheck {
detail: format!("leptonica not found: {}", stderr.trim()),
}
}
Err(e) => {
CheckResult {
name: self.name(),
status: CheckStatus::Fail,
detail: format!("pkg-config check failed: {}", e),
}
}
Err(e) => CheckResult {
name: self.name(),
status: CheckStatus::Fail,
detail: format!("pkg-config check failed: {}", e),
},
}
}
}

View file

@ -1,5 +1,5 @@
use std::process::Command;
use super::super::{Check, CheckResult, CheckStatus, DoctorCtx};
use std::process::Command;
/// Check: libopenjp2 installation (JPEG2000 decoding)
///
@ -14,17 +14,13 @@ impl Check for Libopenjp2Check {
fn run(&self, _ctx: &DoctorCtx) -> CheckResult {
// First check if pkg-config exists
let pkg_check = Command::new("pkg-config")
.arg("--version")
.output();
let pkg_check = Command::new("pkg-config").arg("--version").output();
let pkg_available = pkg_check.is_ok();
if !pkg_available {
// Fallback: try ldconfig -p | grep openjp2
let ldconfig = Command::new("ldconfig")
.arg("-p")
.output();
let ldconfig = Command::new("ldconfig").arg("-p").output();
if let Ok(output) = ldconfig {
let stdout = String::from_utf8_lossy(&output.stdout);
@ -32,7 +28,8 @@ impl Check for Libopenjp2Check {
return CheckResult {
name: self.name(),
status: CheckStatus::Ok,
detail: "libopenjp2 found via ldconfig (pkg-config unavailable)".to_string(),
detail: "libopenjp2 found via ldconfig (pkg-config unavailable)"
.to_string(),
};
}
}
@ -69,20 +66,16 @@ impl Check for Libopenjp2Check {
detail,
}
}
Ok(_) => {
CheckResult {
name: self.name(),
status: CheckStatus::Fail,
detail: "libopenjp2 not found (pkg-config --exists libopenjp2 failed)".to_string(),
}
}
Err(e) => {
CheckResult {
name: self.name(),
status: CheckStatus::Fail,
detail: format!("pkg-config check failed: {}", e),
}
}
Ok(_) => CheckResult {
name: self.name(),
status: CheckStatus::Fail,
detail: "libopenjp2 not found (pkg-config --exists libopenjp2 failed)".to_string(),
},
Err(e) => CheckResult {
name: self.name(),
status: CheckStatus::Fail,
detail: format!("pkg-config check failed: {}", e),
},
}
}
}

View file

@ -1,5 +1,5 @@
use std::process::Command;
use super::super::{Check, CheckResult, CheckStatus, DoctorCtx};
use std::process::Command;
/// Check: libtiff installation (CCITT fax decoding)
///
@ -14,17 +14,13 @@ impl Check for LibtiffCheck {
fn run(&self, _ctx: &DoctorCtx) -> CheckResult {
// First check if pkg-config exists
let pkg_check = Command::new("pkg-config")
.arg("--version")
.output();
let pkg_check = Command::new("pkg-config").arg("--version").output();
let pkg_available = pkg_check.is_ok();
if !pkg_available {
// Fallback: try ldconfig -p | grep tiff
let ldconfig = Command::new("ldconfig")
.arg("-p")
.output();
let ldconfig = Command::new("ldconfig").arg("-p").output();
if let Ok(output) = ldconfig {
let stdout = String::from_utf8_lossy(&output.stdout);
@ -69,20 +65,16 @@ impl Check for LibtiffCheck {
detail,
}
}
Ok(_) => {
CheckResult {
name: self.name(),
status: CheckStatus::Fail,
detail: "libtiff not found (pkg-config --exists libtiff-4 failed)".to_string(),
}
}
Err(e) => {
CheckResult {
name: self.name(),
status: CheckStatus::Fail,
detail: format!("pkg-config check failed: {}", e),
}
}
Ok(_) => CheckResult {
name: self.name(),
status: CheckStatus::Fail,
detail: "libtiff not found (pkg-config --exists libtiff-4 failed)".to_string(),
},
Err(e) => CheckResult {
name: self.name(),
status: CheckStatus::Fail,
detail: format!("pkg-config check failed: {}", e),
},
}
}
}

View file

@ -1,5 +1,5 @@
use std::env;
use super::super::{Check, CheckResult, CheckStatus, DoctorCtx};
use std::env;
/// Check: system locale
///
@ -40,14 +40,19 @@ impl Check for LocaleCheck {
Some(locale) if locale.is_empty() => CheckResult {
name: self.name(),
status: CheckStatus::Warn,
detail: "Locale is empty (LANG/LC_ALL set to empty string, may cause encoding issues)".to_string(),
detail:
"Locale is empty (LANG/LC_ALL set to empty string, may cause encoding issues)"
.to_string(),
},
Some(locale) => {
if locale == "C" || locale == "POSIX" {
CheckResult {
name: self.name(),
status: CheckStatus::Warn,
detail: format!("Locale is '{}' (non-UTF-8, may cause encoding issues)", locale),
detail: format!(
"Locale is '{}' (non-UTF-8, may cause encoding issues)",
locale
),
}
} else if Self::is_utf8_locale(&locale) {
CheckResult {
@ -59,7 +64,10 @@ impl Check for LocaleCheck {
CheckResult {
name: self.name(),
status: CheckStatus::Warn,
detail: format!("Locale '{}' (non-UTF-8, may cause encoding issues)", locale),
detail: format!(
"Locale '{}' (non-UTF-8, may cause encoding issues)",
locale
),
}
}
}

View file

@ -47,7 +47,9 @@ impl MemoryCheck {
for line in meminfo.lines() {
let parts: Vec<&str> = line.split_whitespace().collect();
if parts.len() < 2 { continue; }
if parts.len() < 2 {
continue;
}
if let Ok(kb) = parts[1].parse::<u64>() {
match parts[0] {
@ -148,13 +150,11 @@ impl Check for MemoryCheck {
}
}
}
Err(e) => {
CheckResult {
name: self.name(),
status: CheckStatus::Warn,
detail: format!("Could not determine available memory: {}", e),
}
}
Err(e) => CheckResult {
name: self.name(),
status: CheckStatus::Warn,
detail: format!("Could not determine available memory: {}", e),
},
}
}
}

View file

@ -1,27 +1,27 @@
// Individual check modules
mod binary;
mod cache_dir;
#[cfg(feature = "ocr")]
mod leptonica;
#[cfg(feature = "ocr")]
mod libopenjp2;
#[cfg(feature = "ocr")]
mod libtiff;
mod locale;
mod memory;
#[cfg(feature = "remote")]
mod network;
#[cfg(feature = "full-render")]
mod pdfium;
#[cfg(feature = "profiles")]
mod profile_path;
mod temp_dir;
#[cfg(feature = "ocr")]
mod tesseract;
#[cfg(feature = "ocr")]
mod tesseract_langs;
#[cfg(feature = "ocr")]
mod leptonica;
#[cfg(feature = "ocr")]
mod libtiff;
#[cfg(feature = "ocr")]
mod libopenjp2;
#[cfg(feature = "full-render")]
mod pdfium;
#[cfg(feature = "remote")]
mod network;
mod cache_dir;
#[cfg(feature = "profiles")]
mod profile_path;
#[cfg(unix)]
mod ulimit;
mod memory;
mod locale;
mod temp_dir;
use super::Check;

View file

@ -1,5 +1,5 @@
use std::time::Duration;
use super::super::{Check, CheckResult, CheckStatus, DoctorCtx};
use std::time::Duration;
/// Check: network reachability (remote source feature)
///
@ -43,20 +43,31 @@ impl Check for NetworkCheck {
CheckResult {
name: self.name(),
status: CheckStatus::Warn,
detail: format!("Network reachable but slow: {} in {:.2}s", status, elapsed.as_secs_f64()),
detail: format!(
"Network reachable but slow: {} in {:.2}s",
status,
elapsed.as_secs_f64()
),
}
} else {
CheckResult {
name: self.name(),
status: CheckStatus::Ok,
detail: format!("Network reachable: {} in {:.2}s", status, elapsed.as_secs_f64()),
detail: format!(
"Network reachable: {} in {:.2}s",
status,
elapsed.as_secs_f64()
),
}
}
} else if status >= 300 && status < 400 {
CheckResult {
name: self.name(),
status: CheckStatus::Warn,
detail: format!("Network returned redirect: {} (may indicate proxy or redirect loop)", status),
detail: format!(
"Network returned redirect: {} (may indicate proxy or redirect loop)",
status
),
}
} else {
CheckResult {
@ -66,13 +77,11 @@ impl Check for NetworkCheck {
}
}
}
Err(e) => {
CheckResult {
name: self.name(),
status: CheckStatus::Fail,
detail: e,
}
}
Err(e) => CheckResult {
name: self.name(),
status: CheckStatus::Fail,
detail: e,
},
}
}
}

View file

@ -73,17 +73,18 @@ impl Check for PdfiumCheck {
CheckResult {
name: self.name(),
status: CheckStatus::Warn,
detail: format!("pdfium {} found (< 6555: may have compatibility issues), {}", version, source),
detail: format!(
"pdfium {} found (< 6555: may have compatibility issues), {}",
version, source
),
}
}
}
Err(e) => {
CheckResult {
name: self.name(),
status: CheckStatus::Fail,
detail: format!("pdfium not found: {}", e),
}
}
Err(e) => CheckResult {
name: self.name(),
status: CheckStatus::Fail,
detail: format!("pdfium not found: {}", e),
},
}
}
}

View file

@ -1,6 +1,6 @@
use std::path::{Path, PathBuf};
use std::env;
use super::super::{Check, CheckResult, CheckStatus, DoctorCtx};
use std::env;
use std::path::{Path, PathBuf};
/// Check: temp directory writable and free space
///
@ -25,8 +25,7 @@ impl TempDirCheck {
// Try to create a temporary file
let test_file = path.join(".pdftract-doctor-test");
std::fs::write(&test_file, b"test")
.map_err(|e| format!("Not writable: {}", e))?;
std::fs::write(&test_file, b"test").map_err(|e| format!("Not writable: {}", e))?;
// Clean up
let _ = std::fs::remove_file(&test_file);
@ -36,9 +35,9 @@ impl TempDirCheck {
#[cfg(unix)]
fn check_free_space(path: &Path) -> Result<u64, String> {
use libc::{c_char, statvfs};
use std::ffi::CString;
use std::os::unix::ffi::OsStrExt;
use libc::{statvfs, c_char};
let path_cstr = CString::new(path.as_os_str().as_bytes())
.map_err(|_| "Failed to convert path to CString".to_string())?;
@ -114,20 +113,24 @@ impl Check for TempDirCheck {
}
}
}
(Err(e), _) => {
CheckResult {
name: self.name(),
status: CheckStatus::Fail,
detail: format!("Temp directory check failed at {}: {}", temp_dir.display(), e),
}
}
(_, Err(e)) => {
CheckResult {
name: self.name(),
status: CheckStatus::Warn,
detail: format!("Could not check free space at {}: {}", temp_dir.display(), e),
}
}
(Err(e), _) => CheckResult {
name: self.name(),
status: CheckStatus::Fail,
detail: format!(
"Temp directory check failed at {}: {}",
temp_dir.display(),
e
),
},
(_, Err(e)) => CheckResult {
name: self.name(),
status: CheckStatus::Warn,
detail: format!(
"Could not check free space at {}: {}",
temp_dir.display(),
e
),
},
}
}
}

View file

@ -1,5 +1,5 @@
use std::process::Command;
use super::super::{Check, CheckResult, CheckStatus, DoctorCtx};
use std::process::Command;
/// Check: tesseract installation and version
///
@ -14,9 +14,7 @@ impl Check for TesseractCheck {
}
fn run(&self, _ctx: &DoctorCtx) -> CheckResult {
let output = Command::new("tesseract")
.arg("--version")
.output();
let output = Command::new("tesseract").arg("--version").output();
match output {
Ok(output) => {
@ -61,16 +59,17 @@ impl Check for TesseractCheck {
CheckResult {
name: self.name(),
status: CheckStatus::Warn,
detail: format!("tesseract binary found but version could not be parsed: {}", version_output.trim()),
}
}
Err(e) => {
CheckResult {
name: self.name(),
status: CheckStatus::Fail,
detail: format!("tesseract not found: {}", e),
detail: format!(
"tesseract binary found but version could not be parsed: {}",
version_output.trim()
),
}
}
Err(e) => CheckResult {
name: self.name(),
status: CheckStatus::Fail,
detail: format!("tesseract not found: {}", e),
},
}
}
}

View file

@ -1,5 +1,5 @@
use std::process::Command;
use super::super::{Check, CheckResult, CheckStatus, DoctorCtx};
use std::process::Command;
/// Check: tesseract language availability
///
@ -14,9 +14,7 @@ impl Check for TesseractLangsCheck {
}
fn run(&self, ctx: &DoctorCtx) -> CheckResult {
let output = Command::new("tesseract")
.arg("--list-langs")
.output();
let output = Command::new("tesseract").arg("--list-langs").output();
match output {
Ok(output) => {
@ -24,7 +22,10 @@ impl Check for TesseractLangsCheck {
return CheckResult {
name: self.name(),
status: CheckStatus::Fail,
detail: format!("tesseract --list-langs failed: {}", String::from_utf8_lossy(&output.stderr)),
detail: format!(
"tesseract --list-langs failed: {}",
String::from_utf8_lossy(&output.stderr)
),
};
}
@ -52,7 +53,10 @@ impl Check for TesseractLangsCheck {
return CheckResult {
name: self.name(),
status: CheckStatus::Fail,
detail: format!("Required language 'eng' not found. Installed: {:?}", installed_langs),
detail: format!(
"Required language 'eng' not found. Installed: {:?}",
installed_langs
),
};
}
@ -60,7 +64,10 @@ impl Check for TesseractLangsCheck {
return CheckResult {
name: self.name(),
status: CheckStatus::Warn,
detail: format!("Requested languages not found: {:?}. Installed: {:?}", missing_required, installed_langs),
detail: format!(
"Requested languages not found: {:?}. Installed: {:?}",
missing_required, installed_langs
),
};
}
@ -70,13 +77,11 @@ impl Check for TesseractLangsCheck {
detail: format!("All required languages present: {:?}", installed_langs),
}
}
Err(e) => {
CheckResult {
name: self.name(),
status: CheckStatus::Fail,
detail: format!("tesseract --list-langs failed: {}", e),
}
}
Err(e) => CheckResult {
name: self.name(),
status: CheckStatus::Fail,
detail: format!("tesseract --list-langs failed: {}", e),
},
}
}
}

View file

@ -12,7 +12,7 @@ pub struct UlimitCheck;
impl UlimitCheck {
#[cfg(unix)]
fn get_rlimit_nofile() -> Result<u64, String> {
use libc::{rlimit, RLIMIT_NOFILE, getrlimit};
use libc::{getrlimit, rlimit, RLIMIT_NOFILE};
unsafe {
let mut limits = rlimit {
@ -49,7 +49,10 @@ impl Check for UlimitCheck {
CheckResult {
name: self.name(),
status: CheckStatus::Warn,
detail: format!("File descriptor limit: {} (recommended: >= 1024)", limit),
detail: format!(
"File descriptor limit: {} (recommended: >= 1024)",
limit
),
}
} else {
CheckResult {
@ -59,13 +62,11 @@ impl Check for UlimitCheck {
}
}
}
Err(e) => {
CheckResult {
name: self.name(),
status: CheckStatus::Warn,
detail: format!("Could not read ulimit: {}", e),
}
}
Err(e) => CheckResult {
name: self.name(),
status: CheckStatus::Warn,
detail: format!("Could not read ulimit: {}", e),
},
}
}

View file

@ -1,8 +1,8 @@
//! Doctor subcommand - environment health checks
use anyhow::Result;
use std::path::PathBuf;
use std::panic::{catch_unwind, AssertUnwindSafe};
use std::path::PathBuf;
// Private checks module
mod checks;
@ -179,9 +179,12 @@ pub fn run(opts: DoctorOptions) -> Result<()> {
if opts.json {
output::output_json(&results);
} else {
output::output_text(&results, &output::TextOptions {
no_color: opts.no_color,
})?;
output::output_text(
&results,
&output::TextOptions {
no_color: opts.no_color,
},
)?;
}
// Determine exit code per plan section 6.10 line 2520-2521:

View file

@ -1,7 +1,7 @@
//! Human-readable table output for doctor subcommand
use anyhow::Result;
use crate::doctor::{CheckResult, CheckStatus};
use anyhow::Result;
use std::io::{IsTerminal, Write};
/// Options for text output

View file

@ -1,9 +1,9 @@
//! Output formatting for doctor subcommand
mod features;
mod human;
mod json;
mod features;
pub use features::output_features;
pub use human::{output_text, TextOptions};
pub use json::output_json;
pub use features::output_features;

View file

@ -75,10 +75,10 @@ pub fn render_spans(spans: &[SpanJson]) -> Vec<String> {
/// - `Some(c) where c >= 0.8`: green (#22c55e) - high confidence
fn confidence_to_color(confidence: Option<f64>) -> &'static str {
match confidence {
None => "#94a3b8", // gray - direct extraction
None => "#94a3b8", // gray - direct extraction
Some(c) if c < 0.5 => "#ef4444", // red - low confidence
Some(c) if c < 0.8 => "#eab308", // yellow - medium confidence
Some(_) => "#22c55e", // green - high confidence
Some(_) => "#22c55e", // green - high confidence
}
}
@ -111,16 +111,14 @@ mod tests {
#[test]
fn test_render_spans_single() {
let spans = vec![
SpanJson {
text: "Hello".to_string(),
bbox: [100.0, 200.0, 200.0, 220.0],
font: "Helvetica".to_string(),
size: 12.0,
confidence: None,
receipt: None,
}
];
let spans = vec![SpanJson {
text: "Hello".to_string(),
bbox: [100.0, 200.0, 200.0, 220.0],
font: "Helvetica".to_string(),
size: 12.0,
confidence: None,
receipt: None,
}];
let output = render_spans(&spans);
assert_eq!(output.len(), 1);
@ -149,50 +147,48 @@ mod tests {
#[test]
fn test_render_spans_confidence_colors() {
let test_cases = [
(None, "#94a3b8"), // gray - no confidence
(Some(0.3), "#ef4444"), // red - low
(Some(0.5), "#eab308"), // yellow - medium (boundary)
(Some(0.6), "#eab308"), // yellow - medium
(Some(0.79), "#eab308"), // yellow - medium (boundary)
(Some(0.8), "#22c55e"), // green - high (boundary)
(Some(0.95), "#22c55e"), // green - high
(Some(1.0), "#22c55e"), // green - perfect
(None, "#94a3b8"), // gray - no confidence
(Some(0.3), "#ef4444"), // red - low
(Some(0.5), "#eab308"), // yellow - medium (boundary)
(Some(0.6), "#eab308"), // yellow - medium
(Some(0.79), "#eab308"), // yellow - medium (boundary)
(Some(0.8), "#22c55e"), // green - high (boundary)
(Some(0.95), "#22c55e"), // green - high
(Some(1.0), "#22c55e"), // green - perfect
];
for (confidence, expected_color) in test_cases {
let spans = vec![
SpanJson {
text: "Test".to_string(),
bbox: [0.0, 0.0, 10.0, 10.0],
font: "Arial".to_string(),
size: 10.0,
confidence,
receipt: None,
}
];
let spans = vec![SpanJson {
text: "Test".to_string(),
bbox: [0.0, 0.0, 10.0, 10.0],
font: "Arial".to_string(),
size: 10.0,
confidence,
receipt: None,
}];
let output = render_spans(&spans);
assert_eq!(output.len(), 1);
assert!(
output[0].contains(&format!("stroke=\"{}\"", expected_color)),
"Confidence {:?} should produce color {}, got: {}",
confidence, expected_color, output[0]
confidence,
expected_color,
output[0]
);
}
}
#[test]
fn test_render_spans_data_attributes() {
let spans = vec![
SpanJson {
text: "Test & <quote>".to_string(),
bbox: [50.0, 100.0, 150.0, 120.0],
font: "Times \"Roman\"".to_string(),
size: 14.0,
confidence: Some(0.85),
receipt: None,
}
];
let spans = vec![SpanJson {
text: "Test & <quote>".to_string(),
bbox: [50.0, 100.0, 150.0, 120.0],
font: "Times \"Roman\"".to_string(),
size: 14.0,
confidence: Some(0.85),
receipt: None,
}];
let output = render_spans(&spans);
let rect = &output[0];
@ -283,16 +279,14 @@ mod tests {
#[test]
fn test_render_spans_css_class() {
let spans = vec![
SpanJson {
text: "Test".to_string(),
bbox: [0.0, 0.0, 100.0, 20.0],
font: "Arial".to_string(),
size: 12.0,
confidence: None,
receipt: None,
}
];
let spans = vec![SpanJson {
text: "Test".to_string(),
bbox: [0.0, 0.0, 100.0, 20.0],
font: "Arial".to_string(),
size: 12.0,
confidence: None,
receipt: None,
}];
let output = render_spans(&spans);
assert!(output[0].contains(r#"class="span-rect""#));
@ -325,16 +319,14 @@ mod tests {
#[test]
fn test_render_spans_float_bbox() {
let spans = vec![
SpanJson {
text: "Float".to_string(),
bbox: [10.567, 20.891, 100.234, 110.567],
font: "Arial".to_string(),
size: 12.5,
confidence: None,
receipt: None,
}
];
let spans = vec![SpanJson {
text: "Float".to_string(),
bbox: [10.567, 20.891, 100.234, 110.567],
font: "Arial".to_string(),
size: 12.5,
confidence: None,
receipt: None,
}];
let output = render_spans(&spans);
let rect = &output[0];
@ -348,16 +340,14 @@ mod tests {
#[test]
fn test_render_spans_output_is_valid_svg() {
let spans = vec![
SpanJson {
text: "Valid".to_string(),
bbox: [0.0, 0.0, 100.0, 20.0],
font: "Arial".to_string(),
size: 12.0,
confidence: Some(0.95),
receipt: None,
}
];
let spans = vec![SpanJson {
text: "Valid".to_string(),
bbox: [0.0, 0.0, 100.0, 20.0],
font: "Arial".to_string(),
size: 12.0,
confidence: Some(0.95),
receipt: None,
}];
let output = render_spans(&spans);
let rect = &output[0];

View file

@ -53,7 +53,10 @@ pub fn resolve_token(
.with_context(|| format!("Failed to read token file: {}", path.display()))?;
let token = token_content.trim_end().to_string();
check_token_length(&token);
return Ok(Some((SecretString::new(token.into()), AuthSource::TokenFile)));
return Ok(Some((
SecretString::new(token.into()),
AuthSource::TokenFile,
)));
}
// Priority 2: PDFTRACT_MCP_TOKEN env var
@ -66,10 +69,7 @@ pub fn resolve_token(
// Priority 3: --auth-token VALUE (only if PDFTRACT_INSECURE_CLI_TOKEN=1)
if let Some(token) = cli_token {
let insecure_allowed = env::var("PDFTRACT_INSECURE_CLI_TOKEN")
.ok()
.as_deref()
== Some("1");
let insecure_allowed = env::var("PDFTRACT_INSECURE_CLI_TOKEN").ok().as_deref() == Some("1");
if !insecure_allowed {
anyhow::bail!(
@ -84,7 +84,10 @@ pub fn resolve_token(
Recommended: Use --auth-token-file PATH or PDFTRACT_MCP_TOKEN env var."
);
check_token_length(&token);
return Ok(Some((SecretString::new(token.into()), AuthSource::CliInsecure)));
return Ok(Some((
SecretString::new(token.into()),
AuthSource::CliInsecure,
)));
}
// No token provided

View file

@ -105,11 +105,17 @@ mod tests {
// Non-loopback addresses should fail without a token
let result = check_bind_security("0.0.0.0:8080", false);
assert!(result.is_err());
assert!(result.unwrap_err().to_string().contains("requires --auth-token-file"));
assert!(result
.unwrap_err()
.to_string()
.contains("requires --auth-token-file"));
let result = check_bind_security("192.168.1.1:3000", false);
assert!(result.is_err());
assert!(result.unwrap_err().to_string().contains("requires --auth-token-file"));
assert!(result
.unwrap_err()
.to_string()
.contains("requires --auth-token-file"));
}
#[test]

View file

@ -479,20 +479,17 @@ impl<'de> Deserialize<'de> for BatchMessage {
// Deserialize each array element as a Request
let mut reqs = Vec::with_capacity(arr.len());
for item in arr {
let req = Request::deserialize(item)
.map_err(serde::de::Error::custom)?;
let req = Request::deserialize(item).map_err(serde::de::Error::custom)?;
reqs.push(req);
}
Ok(BatchMessage::Batch(reqs))
}
Value::Object(obj) => {
let req = Request::deserialize(Value::Object(obj))
.map_err(serde::de::Error::custom)?;
let req =
Request::deserialize(Value::Object(obj)).map_err(serde::de::Error::custom)?;
Ok(BatchMessage::Single(req))
}
_ => Err(serde::de::Error::custom(
"expected JSON object or array",
)),
_ => Err(serde::de::Error::custom("expected JSON object or array")),
}
}
}
@ -586,7 +583,11 @@ mod tests {
fn test_batch_round_trip() {
let reqs = vec![
Request::new("tools/list", None, Some(Id::Number(1))),
Request::new("tools/call", Some(Value::Object(serde_json::Map::new())), Some(Id::Number(2))),
Request::new(
"tools/call",
Some(Value::Object(serde_json::Map::new())),
Some(Id::Number(2)),
),
Request::new("prompts/list", None, Some(Id::String("abc".to_string()))),
];
let batch = BatchMessage::Batch(reqs.clone());

View file

@ -24,7 +24,6 @@
use crate::mcp::framing::{BatchMessage, ErrorObject, Id, Notification, Request, Response};
use crate::mcp::tools;
use anyhow::{anyhow, Context, Result};
use subtle::ConstantTimeEq;
use axum::{
body::Body,
extract::{DefaultBodyLimit, Request as AxumRequest, State},
@ -40,6 +39,7 @@ use std::path::PathBuf;
use std::sync::atomic::{AtomicUsize, Ordering};
use std::sync::Arc;
use std::time::Duration;
use subtle::ConstantTimeEq;
use tokio::sync::broadcast;
/// Default maximum request body size (256 MB)
@ -75,7 +75,11 @@ pub struct McpServerState {
impl McpServerState {
/// Create a new MCP server state.
pub fn new(auth_token: Option<SecretString>, max_upload_mb: Option<usize>, root: Option<PathBuf>) -> Self {
pub fn new(
auth_token: Option<SecretString>,
max_upload_mb: Option<usize>,
root: Option<PathBuf>,
) -> Self {
let max_body_bytes = max_upload_mb.unwrap_or(DEFAULT_MAX_UPLOAD_MB) * 1024 * 1024;
let notify_tx = broadcast::channel(100).0; // Channel size 100 for buffered notifications
@ -96,7 +100,9 @@ impl McpServerState {
pub fn broadcast_notification(&self, notification: Notification) -> usize {
// recv_count is the number of receivers that got the message
// (before it was dropped due to channel overflow or lag)
self.notify_tx.send(notification).map_or(0, |recv_count| recv_count)
self.notify_tx
.send(notification)
.map_or(0, |recv_count| recv_count)
}
/// Get the current number of active SSE clients.
@ -162,9 +168,7 @@ pub async fn run_server(
eprintln!();
// Run the server
axum::serve(listener, app)
.await
.context("Server error")?;
axum::serve(listener, app).await.context("Server error")?;
Ok(())
}
@ -199,16 +203,12 @@ async fn handle_post_request(
}
// Parse the request body as either a single Request or a Batch
let batch_result: std::result::Result<BatchMessage, _> =
serde_json::from_str(&body);
let batch_result: std::result::Result<BatchMessage, _> = serde_json::from_str(&body);
let batch = match batch_result {
Ok(batch) => batch,
Err(_) => {
return error_response(
StatusCode::BAD_REQUEST,
ErrorObject::invalid_request(),
);
return error_response(StatusCode::BAD_REQUEST, ErrorObject::invalid_request());
}
};
@ -237,10 +237,7 @@ async fn handle_post_request(
///
/// Returns a long-lived SSE connection that receives server notifications.
/// Sends a keepalive comment every 30 seconds.
async fn handle_sse(
State(state): State<McpServerState>,
headers: HeaderMap,
) -> AxumResponse {
async fn handle_sse(State(state): State<McpServerState>, headers: HeaderMap) -> AxumResponse {
// Check authentication first
match check_auth(&state, &headers) {
Ok(()) => {}
@ -257,7 +254,8 @@ async fn handle_sse(
"error": "Maximum concurrent clients exceeded",
"limit": MAX_SSE_CLIENTS,
})),
).into_response();
)
.into_response();
}
// Subscribe to the broadcast channel
@ -321,11 +319,13 @@ async fn handle_sse(
};
// Return SSE response with appropriate headers
Sse::new(stream).keep_alive(
axum::response::sse::KeepAlive::new()
.interval(Duration::from_secs(SSE_KEEPALIVE_SECS))
.text("keepalive"),
).into_response()
Sse::new(stream)
.keep_alive(
axum::response::sse::KeepAlive::new()
.interval(Duration::from_secs(SSE_KEEPALIVE_SECS))
.text("keepalive"),
)
.into_response()
}
/// GET /health handler - health check endpoint.
@ -393,9 +393,7 @@ fn check_auth(
headers: &HeaderMap,
) -> std::result::Result<(), AxumResponse> {
if let Some(token) = &state.auth_token {
let auth_header = headers
.get("Authorization")
.and_then(|v| v.to_str().ok());
let auth_header = headers.get("Authorization").and_then(|v| v.to_str().ok());
match auth_header {
Some(header) if header.starts_with("Bearer ") => {
@ -408,8 +406,12 @@ fn check_auth(
} else {
let mut response = (
StatusCode::UNAUTHORIZED,
Json(Response::error(Id::Null, ErrorObject::new(-32001, "Invalid authentication token"))),
).into_response();
Json(Response::error(
Id::Null,
ErrorObject::new(-32001, "Invalid authentication token"),
)),
)
.into_response();
response.headers_mut().insert(
"WWW-Authenticate",
HeaderValue::from_static("Bearer realm=\"pdftract\""),
@ -420,8 +422,12 @@ fn check_auth(
_ => {
let mut response = (
StatusCode::UNAUTHORIZED,
Json(Response::error(Id::Null, ErrorObject::new(-32001, "Missing authentication token"))),
).into_response();
Json(Response::error(
Id::Null,
ErrorObject::new(-32001, "Missing authentication token"),
)),
)
.into_response();
response.headers_mut().insert(
"WWW-Authenticate",
HeaderValue::from_static("Bearer realm=\"pdftract\""),
@ -435,7 +441,11 @@ fn check_auth(
}
/// Handle a single JSON-RPC request and return a response.
fn handle_request(request: Request, registry: &tools::ToolRegistry, root: Option<&std::path::Path>) -> Response {
fn handle_request(
request: Request,
registry: &tools::ToolRegistry,
root: Option<&std::path::Path>,
) -> Response {
let id = request.request_id();
match request.method.as_str() {
@ -463,20 +473,29 @@ fn handle_request(request: Request, registry: &tools::ToolRegistry, root: Option
let params = match request.params {
Some(p) => p,
None => {
return Response::error(id, ErrorObject::invalid_params()
.with_data(json!({"reason": "Missing params"})));
return Response::error(
id,
ErrorObject::invalid_params()
.with_data(json!({"reason": "Missing params"})),
);
}
};
let tool_name = match params.get("name").and_then(|v| v.as_str()) {
Some(name) => name,
None => {
return Response::error(id, ErrorObject::invalid_params()
.with_data(json!({"reason": "Missing or invalid 'name' field"})));
return Response::error(
id,
ErrorObject::invalid_params()
.with_data(json!({"reason": "Missing or invalid 'name' field"})),
);
}
};
let arguments = params.get("arguments").cloned().unwrap_or(Value::Object(serde_json::Map::new()));
let arguments = params
.get("arguments")
.cloned()
.unwrap_or(Value::Object(serde_json::Map::new()));
// Look up the tool in the registry
let tool = match registry.get(tool_name) {
@ -488,12 +507,17 @@ fn handle_request(request: Request, registry: &tools::ToolRegistry, root: Option
// Execute the tool with observability logging
let start = std::time::Instant::now();
let log_path = arguments.get("path").and_then(|v| v.as_str()).map(|s| s.to_string());
let log_path = arguments
.get("path")
.and_then(|v| v.as_str())
.map(|s| s.to_string());
let result = tool.execute(arguments, log_path.as_deref(), root);
let duration_ms = start.elapsed().as_millis();
let response_size = result.as_ref().ok()
let response_size = result
.as_ref()
.ok()
.map(|v| serde_json::to_vec(v).unwrap_or_default().len())
.unwrap_or(0);
@ -503,13 +527,9 @@ fn handle_request(request: Request, registry: &tools::ToolRegistry, root: Option
let path_or_hash = log_path.unwrap_or_else(|| "<unknown>".to_string());
let error_code = result.as_ref().err().map(|e| e.code.to_string());
eprintln!("{} tool={} path={} duration_ms={} response_size_bytes={} error_code={:?}",
timestamp,
tool_name,
path_or_hash,
duration_ms,
response_size,
error_code,
eprintln!(
"{} tool={} path={} duration_ms={} response_size_bytes={} error_code={:?}",
timestamp, tool_name, path_or_hash, duration_ms, response_size, error_code,
);
match result {
@ -647,7 +667,10 @@ mod tests {
// No token configured, so any headers should pass
assert!(check_auth(&state, &headers).is_ok());
headers.insert("Authorization", HeaderValue::from_static("Bearer irrelevant"));
headers.insert(
"Authorization",
HeaderValue::from_static("Bearer irrelevant"),
);
assert!(check_auth(&state, &headers).is_ok());
}
@ -657,7 +680,10 @@ mod tests {
let state = McpServerState::new(Some(token), None, None);
let mut headers = HeaderMap::new();
headers.insert("Authorization", HeaderValue::from_static("Bearer correct-token"));
headers.insert(
"Authorization",
HeaderValue::from_static("Bearer correct-token"),
);
assert!(check_auth(&state, &headers).is_ok());
}
@ -667,7 +693,10 @@ mod tests {
let state = McpServerState::new(Some(token), None, None);
let mut headers = HeaderMap::new();
headers.insert("Authorization", HeaderValue::from_static("Bearer wrong-token"));
headers.insert(
"Authorization",
HeaderValue::from_static("Bearer wrong-token"),
);
let result = check_auth(&state, &headers);
assert!(result.is_err());
if let Err(resp) = result {
@ -774,7 +803,10 @@ mod tests {
ratio <= 5,
"Token comparison appears to be non-constant-time: \
early mismatch={:?}, late mismatch={:?}, correct={:?}, ratio={}",
median_early, median_late, median_correct, ratio
median_early,
median_late,
median_correct,
ratio
);
// Also verify that the correct token actually returns true
@ -801,7 +833,10 @@ mod tests {
// Test 2: Token that is much longer
let mut headers_long = HeaderMap::new();
headers_long.insert("Authorization", HeaderValue::from_static("Bearer this-token-is-much-longer-than-the-correct-one"));
headers_long.insert(
"Authorization",
HeaderValue::from_static("Bearer this-token-is-much-longer-than-the-correct-one"),
);
let iterations = 1000;
let mut times_short = Vec::with_capacity(iterations);
@ -840,7 +875,9 @@ mod tests {
ratio <= 3,
"Token comparison appears to leak length information: \
short={:?}, long={:?}, ratio={}",
median_short, median_long, ratio
median_short,
median_long,
ratio
);
}
}

View file

@ -51,7 +51,10 @@ pub fn resolve_path(arg: &str, root: Option<&Path>) -> Result<PathBuf, ErrorObje
// Reject absolute paths when --root is set
if arg.starts_with('/') || Path::new(arg).is_absolute() {
return Err(ErrorObject::invalid_params()
.with_message(format!("absolute paths not permitted under --root: '{}'", arg))
.with_message(format!(
"absolute paths not permitted under --root: '{}'",
arg
))
.with_data(json!({ "code": CODE_ABSOLUTE_PATH_NOT_PERMITTED, "path": arg })));
}
@ -62,7 +65,9 @@ pub fn resolve_path(arg: &str, root: Option<&Path>) -> Result<PathBuf, ErrorObje
let canonical = std::fs::canonicalize(&candidate).map_err(|e| {
ErrorObject::invalid_params()
.with_message(format!("path resolution failed: {}", e))
.with_data(json!({ "code": CODE_PATH_RESOLUTION_FAILED, "path": arg, "error": e.to_string() }))
.with_data(
json!({ "code": CODE_PATH_RESOLUTION_FAILED, "path": arg, "error": e.to_string() }),
)
})?;
// Reject if canonical is not a descendant of root
@ -90,12 +95,19 @@ pub fn resolve_path(arg: &str, root: Option<&Path>) -> Result<PathBuf, ErrorObje
/// * `Err(String)` - Error message if root is invalid
pub fn canonicalize_root(root_arg: &Path) -> Result<PathBuf, String> {
// Canonicalize the root path (follows symlinks, resolves relative components)
let canonical = std::fs::canonicalize(root_arg)
.map_err(|e| format!("--root path does not exist or cannot be canonicalized: {}", e))?;
let canonical = std::fs::canonicalize(root_arg).map_err(|e| {
format!(
"--root path does not exist or cannot be canonicalized: {}",
e
)
})?;
// Verify it's a directory
if !canonical.is_dir() {
return Err(format!("--root must be a directory, not a file: {}", canonical.display()));
return Err(format!(
"--root must be a directory, not a file: {}",
canonical.display()
));
}
Ok(canonical)
@ -112,18 +124,27 @@ mod tests {
fn test_https_url_bypasses_check() {
let result = resolve_path("https://example.com/file.pdf", None);
assert!(result.is_ok());
assert_eq!(result.unwrap(), PathBuf::from("https://example.com/file.pdf"));
assert_eq!(
result.unwrap(),
PathBuf::from("https://example.com/file.pdf")
);
let result = resolve_path("https://example.com/file.pdf", Some(Path::new("/tmp")));
assert!(result.is_ok());
assert_eq!(result.unwrap(), PathBuf::from("https://example.com/file.pdf"));
assert_eq!(
result.unwrap(),
PathBuf::from("https://example.com/file.pdf")
);
}
#[test]
fn test_http_url_bypasses_check() {
let result = resolve_path("http://example.com/file.pdf", None);
assert!(result.is_ok());
assert_eq!(result.unwrap(), PathBuf::from("http://example.com/file.pdf"));
assert_eq!(
result.unwrap(),
PathBuf::from("http://example.com/file.pdf")
);
}
#[test]
@ -195,7 +216,11 @@ mod tests {
#[cfg(windows)]
{
std::os::windows::fs::symlink_file(r"C:\Windows\System32\drivers\etc\hosts", &symlink_path).unwrap();
std::os::windows::fs::symlink_file(
r"C:\Windows\System32\drivers\etc\hosts",
&symlink_path,
)
.unwrap();
}
// Try to access the symlink
@ -264,12 +289,18 @@ mod tests {
let result = resolve_path("/etc/passwd", Some(root));
let err = result.unwrap_err();
let data = err.data.unwrap();
assert_eq!(data.get("code").unwrap().as_str(), Some(CODE_ABSOLUTE_PATH_NOT_PERMITTED));
assert_eq!(
data.get("code").unwrap().as_str(),
Some(CODE_ABSOLUTE_PATH_NOT_PERMITTED)
);
// Test traversal error
let result = resolve_path("../../../etc/passwd", Some(root));
let err = result.unwrap_err();
let data = err.data.unwrap();
assert_eq!(data.get("code").unwrap().as_str(), Some(CODE_PATH_ESCAPES_ROOT));
assert_eq!(
data.get("code").unwrap().as_str(),
Some(CODE_PATH_ESCAPES_ROOT)
);
}
}

View file

@ -70,8 +70,7 @@ pub fn run(
}
// Start the HTTP+SSE server (this blocks until shutdown)
let runtime = tokio::runtime::Runtime::new()
.context("Failed to create tokio runtime")?;
let runtime = tokio::runtime::Runtime::new().context("Failed to create tokio runtime")?;
runtime.block_on(http::run_server(
bind_addr,

View file

@ -61,8 +61,7 @@ fn init_stdout() {
/// CRITICAL: The JSON body is written WITHOUT a trailing newline.
/// Adding any extra bytes after the JSON body breaks the framing.
fn write_response(response: &Response) -> Result<()> {
let json = serde_json::to_string(response)
.context("Failed to serialize response")?;
let json = serde_json::to_string(response).context("Failed to serialize response")?;
let content_length = json.len();
@ -86,8 +85,7 @@ fn write_response(response: &Response) -> Result<()> {
write!(stdout, "{json}")?;
// Flush immediately to ensure the client receives the response
stdout.flush()
.context("Failed to flush stdout")?;
stdout.flush().context("Failed to flush stdout")?;
Ok(())
}
@ -190,7 +188,8 @@ fn read_message(stdin: &mut BufReader<Stdin>) -> Result<Option<Request>> {
// Read headers until empty line
loop {
let mut line = String::new();
let bytes_read = stdin.read_line(&mut line)
let bytes_read = stdin
.read_line(&mut line)
.context("Failed to read header line")?;
if bytes_read == 0 {
@ -208,14 +207,16 @@ fn read_message(stdin: &mut BufReader<Stdin>) -> Result<Option<Request>> {
// Parse Content-Length header
if let Some(value) = line.strip_prefix("Content-Length:") {
let value = value.trim();
content_length = Some(value.parse::<usize>()
.with_context(|| format!("Invalid Content-Length: {value}"))?);
content_length = Some(
value
.parse::<usize>()
.with_context(|| format!("Invalid Content-Length: {value}"))?,
);
}
// Ignore other headers (we don't need Content-Type for now)
}
let content_length = content_length
.ok_or_else(|| anyhow!("Missing Content-Length header"))?;
let content_length = content_length.ok_or_else(|| anyhow!("Missing Content-Length header"))?;
// Read exactly content_length bytes
let mut buffer = vec![0u8; content_length];
@ -236,8 +237,8 @@ fn read_message(stdin: &mut BufReader<Stdin>) -> Result<Option<Request>> {
}
// Parse as JSON-RPC BatchMessage (handles both single requests and batches)
let batch: BatchMessage = serde_json::from_slice(&buffer)
.context("Failed to parse JSON-RPC request")?;
let batch: BatchMessage =
serde_json::from_slice(&buffer).context("Failed to parse JSON-RPC request")?;
// Extract the single request from the batch
// For now, we only support single requests (not batches)
@ -256,7 +257,11 @@ fn read_message(stdin: &mut BufReader<Stdin>) -> Result<Option<Request>> {
}
/// Handle a JSON-RPC request and return a response.
fn handle_request(request: Request, registry: &tools::ToolRegistry, root: Option<&Path>) -> Response {
fn handle_request(
request: Request,
registry: &tools::ToolRegistry,
root: Option<&Path>,
) -> Response {
let id = request.request_id();
match request.method.as_str() {
@ -284,16 +289,22 @@ fn handle_request(request: Request, registry: &tools::ToolRegistry, root: Option
let params = match request.params {
Some(p) => p,
None => {
return Response::error(id, ErrorObject::invalid_params()
.with_data(json!({"reason": "Missing params"})));
return Response::error(
id,
ErrorObject::invalid_params()
.with_data(json!({"reason": "Missing params"})),
);
}
};
let tool_name = match params.get("name").and_then(|v| v.as_str()) {
Some(name) => name,
None => {
return Response::error(id, ErrorObject::invalid_params()
.with_data(json!({"reason": "Missing or invalid 'name' field"})));
return Response::error(
id,
ErrorObject::invalid_params()
.with_data(json!({"reason": "Missing or invalid 'name' field"})),
);
}
};
@ -309,12 +320,17 @@ fn handle_request(request: Request, registry: &tools::ToolRegistry, root: Option
// Execute the tool with observability logging
let start = Instant::now();
let log_path = arguments.get("path").and_then(|v| v.as_str()).map(|s| s.to_string());
let log_path = arguments
.get("path")
.and_then(|v| v.as_str())
.map(|s| s.to_string());
let result = tool.execute(arguments, log_path.as_deref(), root);
let duration_ms = start.elapsed().as_millis();
let response_size = result.as_ref().ok()
let response_size = result
.as_ref()
.ok()
.map(|v| serde_json::to_vec(v).unwrap_or_default().len())
.unwrap_or(0);
@ -323,13 +339,9 @@ fn handle_request(request: Request, registry: &tools::ToolRegistry, root: Option
let path_or_hash = log_path.as_deref().unwrap_or("<unknown>");
let error_code = result.as_ref().err().map(|e| e.code.to_string());
eprintln!("{} tool={} path={} duration_ms={} response_size_bytes={} error_code={:?}",
timestamp,
tool_name,
path_or_hash,
duration_ms,
response_size,
error_code,
eprintln!(
"{} tool={} path={} duration_ms={} response_size_bytes={} error_code={:?}",
timestamp, tool_name, path_or_hash, duration_ms, response_size, error_code,
);
match result {
@ -388,7 +400,13 @@ pub fn run(root: Option<&Path>) -> Result<()> {
eprintln!("pdftract MCP server (stdio mode) starting...");
eprintln!("Version: {}", env!("CARGO_PKG_VERSION"));
eprintln!("Protocol: JSON-RPC 2.0 over stdio");
eprintln!("Tools: {}", registry.tools_list()["tools"].as_array().map(|v| v.len()).unwrap_or(0));
eprintln!(
"Tools: {}",
registry.tools_list()["tools"]
.as_array()
.map(|v| v.len())
.unwrap_or(0)
);
if root.is_some() {
eprintln!("Path-traversal protection: enabled");
} else {
@ -422,10 +440,7 @@ pub fn run(root: Option<&Path>) -> Result<()> {
// Parse error - send error response and continue
eprintln!("Parse error: {}", e);
let error_response = Response::error(
Id::Null,
ErrorObject::parse_error(),
);
let error_response = Response::error(Id::Null, ErrorObject::parse_error());
if let Err(write_err) = write_response(&error_response) {
eprintln!("Failed to write error response: {}", write_err);
@ -444,7 +459,8 @@ pub fn run(root: Option<&Path>) -> Result<()> {
// Flush stdout before exit
if let Some(mut stdout) = STDOUT.lock().unwrap().take() {
stdout.flush()
stdout
.flush()
.context("Failed to flush stdout on shutdown")?;
}
@ -462,10 +478,7 @@ mod tests {
fn test_write_response_framing() {
init_stdout();
let response = Response::success(
Id::Number(1),
serde_json::json!({"result": "ok"}),
);
let response = Response::success(Id::Number(1), serde_json::json!({"result": "ok"}));
// This should succeed (stdout is initialized)
// We can't easily test the actual output without capturing stdout,
@ -481,11 +494,7 @@ mod tests {
#[test]
fn test_handle_unknown_method() {
let registry = tools::all_tools();
let request = Request::new(
"unknown/method",
None,
Some(Id::Number(1)),
);
let request = Request::new("unknown/method", None, Some(Id::Number(1)));
let response = handle_request(request, &registry, None);
@ -497,11 +506,7 @@ mod tests {
#[test]
fn test_handle_tools_list() {
let registry = tools::all_tools();
let request = Request::new(
"tools/list",
None,
Some(Id::Number(1)),
);
let request = Request::new("tools/list", None, Some(Id::Number(1)));
let response = handle_request(request, &registry, None);
@ -512,11 +517,7 @@ mod tests {
/// Test that notifications (no id) return Id::Null.
#[test]
fn test_request_id_notification() {
let request = Request::new(
"notifications/message",
None,
None,
);
let request = Request::new("notifications/message", None, None);
assert_eq!(request.request_id(), Id::Null);
}

View file

@ -5,10 +5,10 @@
//! argument schema (JSON Schema via schemars), structured error mapping, and
//! per-invocation observability.
mod registry;
mod args;
mod registry;
pub use registry::{Tool, ToolRegistry, ToolResult, all_tools};
pub use registry::{all_tools, Tool, ToolRegistry, ToolResult};
// Error codes for pdftract-specific errors (-32099..-32000)
pub const ERROR_NOT_YET_IMPLEMENTED: i64 = -32000;

View file

@ -5,14 +5,20 @@
//! provides the tools/list response.
use super::args::*;
use super::{ERROR_NOT_YET_IMPLEMENTED, ERROR_IO_ERROR, ERROR_PATH_INVALID, CODE_IO_ERROR, CODE_PATH_INVALID};
use super::{
CODE_IO_ERROR, CODE_PATH_INVALID, ERROR_IO_ERROR, ERROR_NOT_YET_IMPLEMENTED, ERROR_PATH_INVALID,
};
use crate::mcp::framing::ErrorObject;
use crate::mcp::root::resolve_path;
use pdftract_core::{
parser::{self, catalog, pages, stream::{MemorySource, PdfSource}, xref},
diagnostics::DiagCode,
options::{ExtractionOptions, ReceiptsMode},
extract::{extract_pdf, result_to_json},
options::{ExtractionOptions, ReceiptsMode},
parser::{
self, catalog, pages,
stream::{MemorySource, PdfSource},
xref,
},
};
use regex::Regex;
use serde_json::{json, to_value, Value};
@ -153,19 +159,19 @@ fn find_startxref_offset(data: &[u8]) -> Result<u64, ErrorObject> {
return Err(ErrorObject::server_error(
super::ERROR_IO_ERROR,
"Invalid startxref offset in PDF",
).with_data(json!({"code": super::CODE_IO_ERROR})));
)
.with_data(json!({"code": super::CODE_IO_ERROR})));
}
let offset_str = std::str::from_utf8(&data[offset_start..offset_end])
.map_err(|_| ErrorObject::server_error(
super::ERROR_IO_ERROR,
"Invalid UTF-8 in startxref offset",
).with_data(json!({"code": super::CODE_IO_ERROR})))?;
let offset_str = std::str::from_utf8(&data[offset_start..offset_end]).map_err(|_| {
ErrorObject::server_error(super::ERROR_IO_ERROR, "Invalid UTF-8 in startxref offset")
.with_data(json!({"code": super::CODE_IO_ERROR}))
})?;
let offset: u64 = offset_str.parse().map_err(|_| ErrorObject::server_error(
super::ERROR_IO_ERROR,
"Failed to parse startxref offset",
).with_data(json!({"code": super::CODE_IO_ERROR})))?;
let offset: u64 = offset_str.parse().map_err(|_| {
ErrorObject::server_error(super::ERROR_IO_ERROR, "Failed to parse startxref offset")
.with_data(json!({"code": super::CODE_IO_ERROR}))
})?;
Ok(offset)
} else {
@ -200,24 +206,26 @@ struct PdfContext {
/// * `path` - The path argument (may be a URL or local path)
/// * `password` - Optional PDF password
/// * `root` - Optional root directory for path-traversal protection
fn open_pdf(path: &str, password: Option<&str>, root: Option<&Path>) -> Result<PdfContext, ErrorObject> {
fn open_pdf(
path: &str,
password: Option<&str>,
root: Option<&Path>,
) -> Result<PdfContext, ErrorObject> {
// Validate and resolve the path using the root if set
let path_buf = resolve_path(path, root)?;
// Check if it's a file (not a directory)
if !path_buf.is_file() {
return Err(ErrorObject::server_error(
ERROR_PATH_INVALID,
format!("Not a file: {}", path),
).with_data(json!({"code": CODE_PATH_INVALID, "path": path})));
return Err(
ErrorObject::server_error(ERROR_PATH_INVALID, format!("Not a file: {}", path))
.with_data(json!({"code": CODE_PATH_INVALID, "path": path})),
);
}
// Read the PDF file
let buffer = fs::read(&path_buf).map_err(|e| {
ErrorObject::server_error(
ERROR_IO_ERROR,
format!("Failed to read PDF file: {}", e),
).with_data(json!({"code": CODE_IO_ERROR, "path": path}))
ErrorObject::server_error(ERROR_IO_ERROR, format!("Failed to read PDF file: {}", e))
.with_data(json!({"code": CODE_IO_ERROR, "path": path}))
})?;
// Check for PDF magic number
@ -225,7 +233,8 @@ fn open_pdf(path: &str, password: Option<&str>, root: Option<&Path>) -> Result<P
return Err(ErrorObject::server_error(
ERROR_IO_ERROR,
"Not a valid PDF file (missing %PDF- header)",
).with_data(json!({"code": CODE_IO_ERROR, "path": path})));
)
.with_data(json!({"code": CODE_IO_ERROR, "path": path})));
}
// Create a MemorySource for parsing
@ -240,7 +249,8 @@ fn open_pdf(path: &str, password: Option<&str>, root: Option<&Path>) -> Result<P
return Err(ErrorObject::server_error(
super::ERROR_PDF_ENCRYPTED,
"PDF is encrypted and no password was provided",
).with_data(json!({"code": super::CODE_PDF_ENCRYPTED})));
)
.with_data(json!({"code": super::CODE_PDF_ENCRYPTED})));
}
}
@ -250,18 +260,19 @@ fn open_pdf(path: &str, password: Option<&str>, root: Option<&Path>) -> Result<P
return Err(ErrorObject::server_error(
super::ERROR_PDF_ENCRYPTED,
"PDF is encrypted and no password was provided",
).with_data(json!({"code": super::CODE_PDF_ENCRYPTED})));
)
.with_data(json!({"code": super::CODE_PDF_ENCRYPTED})));
}
}
// Get the root reference from the trailer
let root_ref = xref_section.trailer.as_ref()
let root_ref = xref_section
.trailer
.as_ref()
.and_then(|trailer| trailer.get("Root"))
.and_then(|obj| {
match obj {
pdftract_core::parser::object::PdfObject::Ref(obj_ref) => Some(obj_ref),
_ => None,
}
.and_then(|obj| match obj {
pdftract_core::parser::object::PdfObject::Ref(obj_ref) => Some(obj_ref),
_ => None,
});
let (catalog, page_count) = match root_ref {
@ -283,11 +294,15 @@ fn open_pdf(path: &str, password: Option<&str>, root: Option<&Path>) -> Result<P
}
Err(diags) => {
// Check for encryption errors
if diags.iter().any(|d| d.code == DiagCode::EncryptionUnsupported) {
if diags
.iter()
.any(|d| d.code == DiagCode::EncryptionUnsupported)
{
return Err(ErrorObject::server_error(
super::ERROR_PDF_ENCRYPTED,
"PDF is encrypted and no password was provided",
).with_data(json!({"code": super::CODE_PDF_ENCRYPTED})));
)
.with_data(json!({"code": super::CODE_PDF_ENCRYPTED})));
}
// Catalog parsing failed - return partial context
(None, None)
@ -345,7 +360,10 @@ fn build_extraction_options(
/// Create a stub response for tools that require Phase 6 extraction surface.
fn stub_extraction_response(path: &str, tool_name: &str, page_count: Option<usize>) -> Value {
let mut response = serde_json::Map::new();
response.insert("_note".to_string(), json!("This tool requires Phase 6 extraction surface"));
response.insert(
"_note".to_string(),
json!("This tool requires Phase 6 extraction surface"),
);
response.insert("_tool".to_string(), json!(tool_name));
response.insert("_path".to_string(), json!(path));
@ -396,8 +414,8 @@ impl Tool for ExtractTool {
fn execute(&self, args: Value, _log_path: Option<&str>, root: Option<&Path>) -> ToolResult {
// Parse arguments
let tool_args: ExtractArgs = serde_json::from_value(args)
.map_err(|_| ErrorObject::invalid_params())?;
let tool_args: ExtractArgs =
serde_json::from_value(args).map_err(|_| ErrorObject::invalid_params())?;
// Check if path is a URL
if is_url(&tool_args.path) {
@ -414,14 +432,17 @@ impl Tool for ExtractTool {
let path_buf = resolve_path(&tool_args.path, root)?;
// Build extraction options
let options = build_extraction_options(&tool_args.pages, &tool_args.ocr, tool_args.receipts.as_deref());
let options = build_extraction_options(
&tool_args.pages,
&tool_args.ocr,
tool_args.receipts.as_deref(),
);
// Perform the extraction
let result = extract_pdf(&path_buf, &options)
.map_err(|e| ErrorObject::server_error(
super::ERROR_IO_ERROR,
format!("Extraction failed: {}", e),
).with_data(json!({"code": super::CODE_IO_ERROR})))?;
let result = extract_pdf(&path_buf, &options).map_err(|e| {
ErrorObject::server_error(super::ERROR_IO_ERROR, format!("Extraction failed: {}", e))
.with_data(json!({"code": super::CODE_IO_ERROR}))
})?;
Ok(result_to_json(&result))
}
@ -444,8 +465,8 @@ impl Tool for ExtractTextTool {
}
fn execute(&self, args: Value, _log_path: Option<&str>, root: Option<&Path>) -> ToolResult {
let tool_args: ExtractTextArgs = serde_json::from_value(args)
.map_err(|_| ErrorObject::invalid_params())?;
let tool_args: ExtractTextArgs =
serde_json::from_value(args).map_err(|_| ErrorObject::invalid_params())?;
if is_url(&tool_args.path) {
return Ok(json!({
@ -460,17 +481,22 @@ impl Tool for ExtractTextTool {
let path_buf = resolve_path(&tool_args.path, root)?;
// Build extraction options
let options = build_extraction_options(&tool_args.pages, &tool_args.ocr, tool_args.receipts.as_deref());
let options = build_extraction_options(
&tool_args.pages,
&tool_args.ocr,
tool_args.receipts.as_deref(),
);
// Perform the extraction
let result = extract_pdf(&path_buf, &options)
.map_err(|e| ErrorObject::server_error(
super::ERROR_IO_ERROR,
format!("Extraction failed: {}", e),
).with_data(json!({"code": super::CODE_IO_ERROR})))?;
let result = extract_pdf(&path_buf, &options).map_err(|e| {
ErrorObject::server_error(super::ERROR_IO_ERROR, format!("Extraction failed: {}", e))
.with_data(json!({"code": super::CODE_IO_ERROR}))
})?;
// Convert to plain text
let text = result.pages.iter()
let text = result
.pages
.iter()
.flat_map(|page| page.spans.iter().map(|span| span.text.as_str()))
.collect::<Vec<&str>>()
.join("\n");
@ -496,8 +522,8 @@ impl Tool for ExtractMarkdownTool {
}
fn execute(&self, args: Value, _log_path: Option<&str>, root: Option<&Path>) -> ToolResult {
let tool_args: ExtractMarkdownArgs = serde_json::from_value(args)
.map_err(|_| ErrorObject::invalid_params())?;
let tool_args: ExtractMarkdownArgs =
serde_json::from_value(args).map_err(|_| ErrorObject::invalid_params())?;
if is_url(&tool_args.path) {
return Ok(json!({
@ -512,19 +538,24 @@ impl Tool for ExtractMarkdownTool {
let path_buf = resolve_path(&tool_args.path, root)?;
// Build extraction options
let options = build_extraction_options(&tool_args.pages, &tool_args.ocr, tool_args.receipts.as_deref());
let options = build_extraction_options(
&tool_args.pages,
&tool_args.ocr,
tool_args.receipts.as_deref(),
);
// Perform the extraction
let result = extract_pdf(&path_buf, &options)
.map_err(|e| ErrorObject::server_error(
super::ERROR_IO_ERROR,
format!("Extraction failed: {}", e),
).with_data(json!({"code": super::CODE_IO_ERROR})))?;
let result = extract_pdf(&path_buf, &options).map_err(|e| {
ErrorObject::server_error(super::ERROR_IO_ERROR, format!("Extraction failed: {}", e))
.with_data(json!({"code": super::CODE_IO_ERROR}))
})?;
// Convert to markdown
let markdown = result.pages.iter()
.flat_map(|page| page.blocks.iter().map(|block| {
match block.kind.as_str() {
let markdown = result
.pages
.iter()
.flat_map(|page| {
page.blocks.iter().map(|block| match block.kind.as_str() {
"heading" => {
let level = block.level.unwrap_or(1);
let prefix = "#".repeat(level as usize);
@ -532,8 +563,8 @@ impl Tool for ExtractMarkdownTool {
}
"paragraph" => format!("{}\n", block.text),
_ => format!("{}\n", block.text),
}
}))
})
})
.collect::<Vec<String>>()
.join("\n");
@ -558,8 +589,8 @@ impl Tool for SearchTool {
}
fn execute(&self, args: Value, _log_path: Option<&str>, root: Option<&Path>) -> ToolResult {
let tool_args: SearchArgs = serde_json::from_value(args)
.map_err(|_| ErrorObject::invalid_params())?;
let tool_args: SearchArgs =
serde_json::from_value(args).map_err(|_| ErrorObject::invalid_params())?;
// Validate the regex pattern
let _regex = Regex::new(&tool_args.pattern).map_err(|e| {
@ -603,8 +634,8 @@ impl Tool for GetMetadataTool {
}
fn execute(&self, args: Value, _log_path: Option<&str>, root: Option<&Path>) -> ToolResult {
let tool_args: GetMetadataArgs = serde_json::from_value(args)
.map_err(|_| ErrorObject::invalid_params())?;
let tool_args: GetMetadataArgs =
serde_json::from_value(args).map_err(|_| ErrorObject::invalid_params())?;
// Check if path is a URL
if is_url(&tool_args.path) {
@ -657,14 +688,18 @@ fn extract_metadata(path: &str, _password: Option<&str>, root: Option<&Path>) ->
// Fingerprint - compute a simple one based on file size and page count
// Full fingerprint computation would use the Phase 1.7 algorithm
let fingerprint = format!("pdftract-v1:{:064x}",
let fingerprint = format!(
"pdftract-v1:{:064x}",
sha2::Sha256::digest(
format!("{}:{}:{}",
format!(
"{}:{}:{}",
ctx.source.len().unwrap_or(0),
ctx.page_count.unwrap_or(0),
catalog.pages_ref.object
).as_bytes()
));
)
.as_bytes()
)
);
Ok(json!({
"metadata": metadata,
@ -673,13 +708,17 @@ fn extract_metadata(path: &str, _password: Option<&str>, root: Option<&Path>) ->
}))
} else {
// Catalog not available, return partial metadata
let fingerprint = format!("pdftract-v1:{:064x}",
let fingerprint = format!(
"pdftract-v1:{:064x}",
sha2::Sha256::digest(
format!("{}:{}",
format!(
"{}:{}",
ctx.source.len().unwrap_or(0),
ctx.page_count.unwrap_or(0)
).as_bytes()
));
)
.as_bytes()
)
);
Ok(json!({
"metadata": metadata,
@ -706,8 +745,8 @@ impl Tool for HashTool {
}
fn execute(&self, args: Value, _log_path: Option<&str>, root: Option<&Path>) -> ToolResult {
let tool_args: HashArgs = serde_json::from_value(args)
.map_err(|_| ErrorObject::invalid_params())?;
let tool_args: HashArgs =
serde_json::from_value(args).map_err(|_| ErrorObject::invalid_params())?;
// Check if path is a URL
if is_url(&tool_args.path) {
@ -728,31 +767,43 @@ impl Tool for HashTool {
}
/// Compute the fingerprint of a PDF file.
fn compute_fingerprint(path: &str, _password: Option<&str>, root: Option<&Path>) -> Result<String, ErrorObject> {
fn compute_fingerprint(
path: &str,
_password: Option<&str>,
root: Option<&Path>,
) -> Result<String, ErrorObject> {
let ctx = open_pdf(path, _password, root)?;
// Compute a simplified fingerprint for now
// Full fingerprint computation would use the Phase 1.7 algorithm with
// content stream hashing, resource dict hashing, etc.
if let Some(catalog) = &ctx.catalog {
let fingerprint = format!("pdftract-v1:{:064x}",
let fingerprint = format!(
"pdftract-v1:{:064x}",
sha2::Sha256::digest(
format!("{}:{}:{}:{}",
format!(
"{}:{}:{}:{}",
ctx.source.len().unwrap_or(0),
ctx.page_count.unwrap_or(0),
catalog.pages_ref.object,
catalog.mark_info.is_tagged
).as_bytes()
));
)
.as_bytes()
)
);
Ok(fingerprint)
} else {
let fingerprint = format!("pdftract-v1:{:064x}",
let fingerprint = format!(
"pdftract-v1:{:064x}",
sha2::Sha256::digest(
format!("{}:{}",
format!(
"{}:{}",
ctx.source.len().unwrap_or(0),
ctx.page_count.unwrap_or(0)
).as_bytes()
));
)
.as_bytes()
)
);
Ok(fingerprint)
}
}
@ -1006,7 +1057,11 @@ mod tests {
// Test get_table
let tool = registry.get("get_table").unwrap();
let result = tool.execute(json!({"path": "test.pdf", "page": 0, "table_index": 0}), None, None);
let result = tool.execute(
json!({"path": "test.pdf", "page": 0, "table_index": 0}),
None,
None,
);
assert!(result.is_err());
let err = result.unwrap_err();
assert_eq!(err.code, ERROR_NOT_YET_IMPLEMENTED);
@ -1061,7 +1116,10 @@ mod tests {
// Create a JSON Schema validator
let compilation_result = jsonschema::JSONSchema::compile(&schema);
assert!(compilation_result.is_ok(), "Extract tool schema should be valid JSON Schema");
assert!(
compilation_result.is_ok(),
"Extract tool schema should be valid JSON Schema"
);
}
#[test]
@ -1070,7 +1128,10 @@ mod tests {
let schema = tool.input_schema();
let compilation_result = jsonschema::JSONSchema::compile(&schema);
assert!(compilation_result.is_ok(), "ExtractText tool schema should be valid JSON Schema");
assert!(
compilation_result.is_ok(),
"ExtractText tool schema should be valid JSON Schema"
);
}
#[test]
@ -1079,7 +1140,10 @@ mod tests {
let schema = tool.input_schema();
let compilation_result = jsonschema::JSONSchema::compile(&schema);
assert!(compilation_result.is_ok(), "ExtractMarkdown tool schema should be valid JSON Schema");
assert!(
compilation_result.is_ok(),
"ExtractMarkdown tool schema should be valid JSON Schema"
);
}
#[test]
@ -1088,7 +1152,10 @@ mod tests {
let schema = tool.input_schema();
let compilation_result = jsonschema::JSONSchema::compile(&schema);
assert!(compilation_result.is_ok(), "Search tool schema should be valid JSON Schema");
assert!(
compilation_result.is_ok(),
"Search tool schema should be valid JSON Schema"
);
}
#[test]
@ -1097,7 +1164,10 @@ mod tests {
let schema = tool.input_schema();
let compilation_result = jsonschema::JSONSchema::compile(&schema);
assert!(compilation_result.is_ok(), "GetMetadata tool schema should be valid JSON Schema");
assert!(
compilation_result.is_ok(),
"GetMetadata tool schema should be valid JSON Schema"
);
}
#[test]
@ -1106,7 +1176,10 @@ mod tests {
let schema = tool.input_schema();
let compilation_result = jsonschema::JSONSchema::compile(&schema);
assert!(compilation_result.is_ok(), "Hash tool schema should be valid JSON Schema");
assert!(
compilation_result.is_ok(),
"Hash tool schema should be valid JSON Schema"
);
}
#[test]
@ -1115,7 +1188,10 @@ mod tests {
let schema = tool.input_schema();
let compilation_result = jsonschema::JSONSchema::compile(&schema);
assert!(compilation_result.is_ok(), "GetTable tool schema should be valid JSON Schema");
assert!(
compilation_result.is_ok(),
"GetTable tool schema should be valid JSON Schema"
);
}
#[test]
@ -1124,7 +1200,10 @@ mod tests {
let schema = tool.input_schema();
let compilation_result = jsonschema::JSONSchema::compile(&schema);
assert!(compilation_result.is_ok(), "GetFormFields tool schema should be valid JSON Schema");
assert!(
compilation_result.is_ok(),
"GetFormFields tool schema should be valid JSON Schema"
);
}
#[test]
@ -1133,7 +1212,10 @@ mod tests {
let schema = tool.input_schema();
let compilation_result = jsonschema::JSONSchema::compile(&schema);
assert!(compilation_result.is_ok(), "GetAttachments tool schema should be valid JSON Schema");
assert!(
compilation_result.is_ok(),
"GetAttachments tool schema should be valid JSON Schema"
);
}
#[test]
@ -1142,7 +1224,10 @@ mod tests {
let schema = tool.input_schema();
let compilation_result = jsonschema::JSONSchema::compile(&schema);
assert!(compilation_result.is_ok(), "Classify tool schema should be valid JSON Schema");
assert!(
compilation_result.is_ok(),
"Classify tool schema should be valid JSON Schema"
);
}
#[test]
@ -1152,10 +1237,12 @@ mod tests {
for (_key, tool) in &registry.tools {
let schema = tool.input_schema();
let compilation_result = jsonschema::JSONSchema::compile(&schema);
assert!(compilation_result.is_ok(),
assert!(
compilation_result.is_ok(),
"Tool '{}' schema should be valid JSON Schema: {:?}",
tool.name(),
compilation_result.err());
compilation_result.err()
);
}
}

View file

@ -105,7 +105,9 @@ fn read_password_from_stdin() -> Result<Option<secrecy::SecretString>> {
return Ok(None);
}
Ok(Some(secrecy::SecretString::new(password.to_string().into_boxed_str())))
Ok(Some(secrecy::SecretString::new(
password.to_string().into_boxed_str(),
)))
}
#[cfg(test)]
@ -153,7 +155,10 @@ mod tests {
fn test_resolve_password_empty_env_var() {
std::env::set_var(ENV_PASSWORD, "");
let result = resolve_password(false, None).unwrap();
assert!(result.is_none(), "Empty env var should be treated as no password");
assert!(
result.is_none(),
"Empty env var should be treated as no password"
);
std::env::remove_var(ENV_PASSWORD);
}

View file

@ -25,9 +25,9 @@ use axum::{
routing::{get, post},
Router,
};
use pdftract_core::options::{ExtractionOptions, ReceiptsMode};
use pdftract_core::extract::{extract_pdf, result_to_json};
use pdftract_core::cache;
use pdftract_core::extract::{extract_pdf, result_to_json};
use pdftract_core::options::{ExtractionOptions, ReceiptsMode};
use serde::Deserialize;
use std::path::{Path, PathBuf};
use std::sync::Arc;
@ -145,17 +145,23 @@ pub async fn run(
.layer(RequestBodyLimitLayer::new(max_body_bytes))
.with_state(state);
let listener = tokio::net::TcpListener::bind(&bind_addr).await
let listener = tokio::net::TcpListener::bind(&bind_addr)
.await
.context(format!("Failed to bind to {}", bind_addr))?;
eprintln!("pdftract serve listening on http://{}", bind_addr);
if let Some(dir) = cache_dir_for_logging {
eprintln!("Cache enabled: {} (max {} bytes)", dir.display(), cache_size_bytes);
eprintln!(
"Cache enabled: {} (max {} bytes)",
dir.display(),
cache_size_bytes
);
} else {
eprintln!("Cache disabled");
}
axum::serve(listener, app).await
axum::serve(listener, app)
.await
.context("HTTP server error")?;
Ok(())
@ -199,8 +205,14 @@ async fn extract_handler(
let pdf_file_clone = pdf_file.clone();
let (result, cache_status, cache_age) = tokio::task::spawn_blocking(move || {
let cache_dir_ref = cache_dir.as_deref();
cache::extract_with_cache(&pdf_file_clone, &options, cache_dir_ref, cache_disabled, Some(cache_size_bytes))
.map_err(|e| AxumError::Extraction(format!("{:?}", e)))
cache::extract_with_cache(
&pdf_file_clone,
&options,
cache_dir_ref,
cache_disabled,
Some(cache_size_bytes),
)
.map_err(|e| AxumError::Extraction(format!("{:?}", e)))
})
.await
.map_err(|e| AxumError::Internal(format!("{:?}", e)))?
@ -216,7 +228,10 @@ async fn extract_handler(
let response = AxumResponse::builder()
.status(StatusCode::OK)
.header("Content-Type", "application/json")
.header("X-Pdftract-Cache", CacheStatus::from_string(&cache_status).header_value())
.header(
"X-Pdftract-Cache",
CacheStatus::from_string(&cache_status).header_value(),
)
.body(Body::from(serde_json::to_string(&json).unwrap()))
.map_err(|e| AxumError::Internal(format!("{:?}", e)))?;
@ -240,8 +255,14 @@ async fn extract_text_handler(
let (result, cache_status, _cache_age) = tokio::task::spawn_blocking(move || {
let cache_dir_ref = cache_dir.as_deref();
cache::extract_with_cache(&pdf_file, &options, cache_dir_ref, cache_disabled, Some(cache_size_bytes))
.map_err(|e| AxumError::Extraction(format!("{:?}", e)))
cache::extract_with_cache(
&pdf_file,
&options,
cache_dir_ref,
cache_disabled,
Some(cache_size_bytes),
)
.map_err(|e| AxumError::Extraction(format!("{:?}", e)))
})
.await
.map_err(|e| AxumError::Internal(format!("{:?}", e)))?
@ -257,7 +278,10 @@ async fn extract_text_handler(
let response = AxumResponse::builder()
.status(StatusCode::OK)
.header("X-Pdftract-Cache", CacheStatus::from_string(&cache_status).header_value())
.header(
"X-Pdftract-Cache",
CacheStatus::from_string(&cache_status).header_value(),
)
.body(Body::from(text))
.map_err(|e| AxumError::Internal(format!("{:?}", e)))?;
@ -281,8 +305,14 @@ async fn extract_stream_handler(
let (result, _cache_status, _cache_age) = tokio::task::spawn_blocking(move || {
let cache_dir_ref = cache_dir.as_deref();
cache::extract_with_cache(&pdf_file, &options, cache_dir_ref, cache_disabled, Some(cache_size_bytes))
.map_err(|e| AxumError::Extraction(format!("{:?}", e)))
cache::extract_with_cache(
&pdf_file,
&options,
cache_dir_ref,
cache_disabled,
Some(cache_size_bytes),
)
.map_err(|e| AxumError::Extraction(format!("{:?}", e)))
})
.await
.map_err(|e| AxumError::Internal(format!("{:?}", e)))?
@ -319,19 +349,24 @@ async fn receive_pdf(multipart: &mut Multipart) -> Result<(PathBuf, ExtractParam
full_render: false,
};
while let Some(field) = multipart.next_field().await
while let Some(field) = multipart
.next_field()
.await
.map_err(|e| AxumError::Internal(format!("{:?}", e)))?
{
let name = field.name().unwrap_or("").to_string();
if name == "file" || name == "pdf" {
let data = field.bytes().await
let data = field
.bytes()
.await
.map_err(|e| AxumError::Internal(format!("{:?}", e)))?;
// Create a temp file that will persist for the duration of the request
let temp_dir = std::env::temp_dir();
let temp_file = temp_dir.join(format!("pdftract-upload-{}.pdf", uuid::Uuid::new_v4()));
tokio::fs::write(&temp_file, &data).await
tokio::fs::write(&temp_file, &data)
.await
.map_err(|e| AxumError::Internal(format!("{:?}", e)))?;
pdf_path = Some(temp_file);
} else if name == "receipts" {
@ -352,7 +387,8 @@ async fn receive_pdf(multipart: &mut Multipart) -> Result<(PathBuf, ExtractParam
}
}
let pdf_path = pdf_path.ok_or_else(|| AxumError::BadRequest("No PDF file uploaded".to_string()))?;
let pdf_path =
pdf_path.ok_or_else(|| AxumError::BadRequest("No PDF file uploaded".to_string()))?;
Ok((pdf_path, params))
}
@ -378,7 +414,8 @@ fn build_options(params: &ExtractParams) -> Result<ExtractionOptions, AxumError>
if !has_full_render() {
return Err(AxumError::BadRequest(
"full_render requested but PDFium is not available at runtime. \
Ensure the PDFium native library is installed.".to_string()
Ensure the PDFium native library is installed."
.to_string(),
));
}
}

View file

@ -6,11 +6,11 @@
use anyhow::{Context, Result};
use clap::Args;
use pdftract_core::document::{self, compute_pdf_fingerprint, extract_spans_from_page};
use pdftract_core::receipts::Receipt;
use pdftract_core::receipts::verifier::{exit_code, SpanData, VerificationResult};
use pdftract_core::receipts::Receipt;
use std::fs;
use std::path::PathBuf;
use std::io::{self, Read};
use std::path::PathBuf;
/// Verify a receipt against a PDF file.
#[derive(Args)]
@ -96,7 +96,10 @@ pub fn run_verify_receipt(cmd: VerifyReceiptCommand) -> Result<()> {
binary_version,
) {
eprintln!("Error: {}", e);
eprintln!("Install pdftract v{} to verify this receipt", receipt.extraction_version);
eprintln!(
"Install pdftract v{} to verify this receipt",
receipt.extraction_version
);
std::process::exit(exit_code::EXTRACTION_FAILED);
}
@ -130,18 +133,18 @@ pub fn run_verify_receipt(cmd: VerifyReceiptCommand) -> Result<()> {
Ok(spans) => spans,
Err(e) => {
if !cmd.json && !cmd.quiet {
eprintln!("Error: Failed to extract spans from page {}: {}", receipt.page_index, e);
eprintln!(
"Error: Failed to extract spans from page {}: {}",
receipt.page_index, e
);
}
std::process::exit(exit_code::EXTRACTION_FAILED);
}
};
// Step 5: Run verification protocol
let result = pdftract_core::receipts::verifier::verify_receipt(
&receipt,
&spans,
&actual_fingerprint,
);
let result =
pdftract_core::receipts::verifier::verify_receipt(&receipt, &spans, &actual_fingerprint);
// Step 6: Output result
output_result(&result, &receipt, &actual_fingerprint, &cmd);
@ -156,7 +159,8 @@ fn load_receipt(cmd: &VerifyReceiptCommand) -> Result<Receipt> {
inline.clone()
} else if cmd.stdin || cmd.receipt_path.to_string_lossy() == "-" {
let mut buffer = String::new();
io::stdin().read_to_string(&mut buffer)
io::stdin()
.read_to_string(&mut buffer)
.context("Failed to read receipt from stdin")?;
buffer
} else {
@ -164,8 +168,8 @@ fn load_receipt(cmd: &VerifyReceiptCommand) -> Result<Receipt> {
.with_context(|| format!("Failed to read receipt from {:?}", cmd.receipt_path))?
};
let receipt: Receipt = serde_json::from_str(&receipt_json)
.context("Failed to parse receipt JSON")?;
let receipt: Receipt =
serde_json::from_str(&receipt_json).context("Failed to parse receipt JSON")?;
Ok(receipt)
}
@ -179,7 +183,10 @@ fn output_result(
if cmd.json {
// JSON output
let output = match result {
VerificationResult::Ok { best_iou, actual_content_hash } => {
VerificationResult::Ok {
best_iou,
actual_content_hash,
} => {
let expected_hash = receipt.content_hash.clone();
VerificationJsonOutput {
status: "ok".to_string(),
@ -202,45 +209,47 @@ fn output_result(
error: Some(format!("Expected fingerprint {}, got {}", expected, actual)),
}
}
VerificationResult::BboxMismatch { best_iou, threshold } => {
VerificationJsonOutput {
status: "bbox_mismatch".to_string(),
pdf_fingerprint: actual_fingerprint.to_string(),
page_index: receipt.page_index,
best_iou: *best_iou,
expected_content_hash: None,
actual_content_hash: None,
error: Some(format!(
"No span meets IoU threshold {} (best IoU: {:.3})",
threshold, best_iou
)),
}
}
VerificationResult::BboxMismatch {
best_iou,
threshold,
} => VerificationJsonOutput {
status: "bbox_mismatch".to_string(),
pdf_fingerprint: actual_fingerprint.to_string(),
page_index: receipt.page_index,
best_iou: *best_iou,
expected_content_hash: None,
actual_content_hash: None,
error: Some(format!(
"No span meets IoU threshold {} (best IoU: {:.3})",
threshold, best_iou
)),
},
VerificationResult::ContentMismatch {
best_iou,
expected_hash,
actual_hash,
} => {
VerificationJsonOutput {
status: "content_mismatch".to_string(),
pdf_fingerprint: actual_fingerprint.to_string(),
page_index: receipt.page_index,
best_iou: *best_iou,
expected_content_hash: Some(expected_hash.clone()),
actual_content_hash: Some(actual_hash.clone()),
error: Some(format!(
"Content hash mismatch: expected {}, got {}",
expected_hash, actual_hash
)),
}
}
} => VerificationJsonOutput {
status: "content_mismatch".to_string(),
pdf_fingerprint: actual_fingerprint.to_string(),
page_index: receipt.page_index,
best_iou: *best_iou,
expected_content_hash: Some(expected_hash.clone()),
actual_content_hash: Some(actual_hash.clone()),
error: Some(format!(
"Content hash mismatch: expected {}, got {}",
expected_hash, actual_hash
)),
},
};
println!("{}", serde_json::to_string(&output).unwrap());
} else if !cmd.quiet {
// Human-readable output
match result {
VerificationResult::Ok { best_iou, actual_content_hash } => {
VerificationResult::Ok {
best_iou,
actual_content_hash,
} => {
println!(
"Receipt verified: {} page {} bbox [{}, {}, {}, {}]",
receipt.pdf_fingerprint,
@ -250,7 +259,10 @@ fn output_result(
receipt.bbox[2],
receipt.bbox[3]
);
println!("Best-match span IoU: {:.3}, content_hash: {}", best_iou, actual_content_hash);
println!(
"Best-match span IoU: {:.3}, content_hash: {}",
best_iou, actual_content_hash
);
}
VerificationResult::FingerprintMismatch { expected, actual } => {
eprintln!("Error: PDF fingerprint mismatch");
@ -259,14 +271,24 @@ fn output_result(
eprintln!();
eprintln!("The receipt was created for a different PDF file.");
}
VerificationResult::BboxMismatch { best_iou, threshold } => {
eprintln!("Error: Bbox mismatch (no span meets {}% IoU threshold)", threshold * 100.0);
VerificationResult::BboxMismatch {
best_iou,
threshold,
} => {
eprintln!(
"Error: Bbox mismatch (no span meets {}% IoU threshold)",
threshold * 100.0
);
eprintln!(" Best IoU: {:.3}%", best_iou * 100.0);
eprintln!(" Receipt bbox: [{}, {}, {}, {}]",
receipt.bbox[0], receipt.bbox[1], receipt.bbox[2], receipt.bbox[3]);
eprintln!(
" Receipt bbox: [{}, {}, {}, {}]",
receipt.bbox[0], receipt.bbox[1], receipt.bbox[2], receipt.bbox[3]
);
eprintln!();
eprintln!("No text span on page {} matches the receipt's bounding box.",
receipt.page_index);
eprintln!(
"No text span on page {} matches the receipt's bounding box.",
receipt.page_index
);
}
VerificationResult::ContentMismatch {
best_iou,
@ -278,7 +300,9 @@ fn output_result(
eprintln!(" Expected hash: {}", expected_hash);
eprintln!(" Actual hash: {}", actual_hash);
eprintln!();
eprintln!("The text at the receipt's location has changed since the receipt was created.");
eprintln!(
"The text at the receipt's location has changed since the receipt was created."
);
}
}
}

View file

@ -19,14 +19,8 @@ const SDK_VERSION: &str = env!("CARGO_PKG_VERSION");
/// Simple semver comparison - returns Less if v1 < v2
fn compare_versions(v1: &str, v2: &str) -> std::cmp::Ordering {
let v1_parts: Vec<u32> = v1
.split('.')
.filter_map(|s| s.parse().ok())
.collect();
let v2_parts: Vec<u32> = v2
.split('.')
.filter_map(|s| s.parse().ok())
.collect();
let v1_parts: Vec<u32> = v1.split('.').filter_map(|s| s.parse().ok()).collect();
let v2_parts: Vec<u32> = v2.split('.').filter_map(|s| s.parse().ok()).collect();
for (a, b) in v1_parts.iter().zip(v2_parts.iter()) {
match a.cmp(b) {
@ -181,8 +175,8 @@ fn run_conformance(suite_path: &str, output_path: &str) -> Result<()> {
}
fn load_suite(path: &str) -> Result<Value> {
let suite_json = fs::read_to_string(path)
.context(format!("Failed to read suite from {}", path))?;
let suite_json =
fs::read_to_string(path).context(format!("Failed to read suite from {}", path))?;
serde_json::from_str(&suite_json).context("Failed to parse suite as JSON")
}
@ -212,8 +206,14 @@ fn run_test_case(case: &Value, schema_version: &str) -> Result<TestResult> {
let fixture = case["fixture"].as_str().unwrap_or("");
let method = case["method"].as_str().unwrap_or("extract");
let options = case.get("options").cloned().unwrap_or(Value::Object(Default::default()));
let expected = case.get("expected").cloned().unwrap_or(Value::Object(Default::default()));
let options = case
.get("options")
.cloned()
.unwrap_or(Value::Object(Default::default()));
let expected = case
.get("expected")
.cloned()
.unwrap_or(Value::Object(Default::default()));
let tolerances = case.get("tolerances").cloned();
let fixture_path = if fixture.starts_with("http://") || fixture.starts_with("https://") {
@ -283,10 +283,10 @@ fn execute_method(method: &str, fixture: &str, options: &Value) -> Result<Value>
}))
}
"extract_text" => Ok(Value::String("Sample text content".to_string())),
"extract_markdown" => Ok(Value::String("# Sample Markdown\n\nContent here".to_string())),
"extract_stream" => {
Ok(serde_json::json!({"output_type": "iterator", "frame_count": 3}))
}
"extract_markdown" => Ok(Value::String(
"# Sample Markdown\n\nContent here".to_string(),
)),
"extract_stream" => Ok(serde_json::json!({"output_type": "iterator", "frame_count": 3})),
"search" => Ok(serde_json::json!({
"output_type": "iterator",
"matches": [{"page": 0, "text": "found"}]
@ -346,7 +346,10 @@ fn compare_recursive(
}
}
(Value::String(act), Value::Object(exp)) => {
if let Some(min_len) = exp.get("min_length").and_then(|v| v.as_u64().map(|v| v as usize)) {
if let Some(min_len) = exp
.get("min_length")
.and_then(|v| v.as_u64().map(|v| v as usize))
{
if act.len() < min_len {
return Err(format!(
"[{}]: string length {} is less than minimum {}",
@ -428,14 +431,14 @@ fn compare_number(
tolerance: Option<&Value>,
path: &str,
) -> Result<(), String> {
let act_val = actual.as_f64().ok_or_else(|| {
format!("[{}]: actual number is not f64-representable", path)
})?;
let act_val = actual
.as_f64()
.ok_or_else(|| format!("[{}]: actual number is not f64-representable", path))?;
let exp_val = match expected {
Value::Number(n) => n.as_f64().ok_or_else(|| {
format!("[{}]: expected number is not f64-representable", path)
})?,
Value::Number(n) => n
.as_f64()
.ok_or_else(|| format!("[{}]: expected number is not f64-representable", path))?,
_ => {
return Err(format!("[{}]: expected value is not a number", path));
}
@ -532,13 +535,15 @@ fn write_report(report: &ConformanceReport, path: &str) -> Result<()> {
obj.insert("id".to_string(), Value::String(r.id.clone()));
obj.insert(
"status".to_string(),
Value::String(match r.status {
TestStatus::Pass => "pass",
TestStatus::Fail => "fail",
TestStatus::Skip => "skip",
TestStatus::Error => "error",
}
.to_string()),
Value::String(
match r.status {
TestStatus::Pass => "pass",
TestStatus::Fail => "fail",
TestStatus::Skip => "skip",
TestStatus::Error => "error",
}
.to_string(),
),
);
if let Some(actual) = &r.actual {
obj.insert("actual".to_string(), actual.clone());

View file

@ -24,13 +24,27 @@ fn test_stdio_and_bind_mutually_exclusive() {
.expect("Failed to execute pdftract mcp --stdio --bind");
// Should fail with exit code 2 (clap's error exit code)
assert_eq!(output.status.code(), Some(2), "Expected exit code 2, got {:?}", output.status.code());
assert_eq!(
output.status.code(),
Some(2),
"Expected exit code 2, got {:?}",
output.status.code()
);
// Error message should mention both flags
let stderr = String::from_utf8_lossy(&output.stderr);
assert!(stderr.contains("--stdio"), "Error message should mention --stdio");
assert!(stderr.contains("--bind"), "Error message should mention --bind");
assert!(stderr.contains("cannot be used"), "Error message should mention conflict");
assert!(
stderr.contains("--stdio"),
"Error message should mention --stdio"
);
assert!(
stderr.contains("--bind"),
"Error message should mention --bind"
);
assert!(
stderr.contains("cannot be used"),
"Error message should mention conflict"
);
}
/// Test that `pdftract mcp` (no flags) parses successfully.
@ -45,12 +59,21 @@ fn test_default_to_stdio() {
.expect("Failed to execute pdftract mcp --help");
// Should succeed
assert!(output.status.success(), "pdftract mcp --help should succeed");
assert!(
output.status.success(),
"pdftract mcp --help should succeed"
);
// Help text should mention the default behavior
let stdout = String::from_utf8_lossy(&output.stdout);
assert!(stdout.contains("default"), "Help should mention default transport mode");
assert!(stdout.contains("stdio"), "Help should mention stdio transport");
assert!(
stdout.contains("default"),
"Help should mention default transport mode"
);
assert!(
stdout.contains("stdio"),
"Help should mention stdio transport"
);
}
/// Test that `pdftract mcp --stdio` parses successfully.
@ -67,7 +90,10 @@ fn test_stdio_flag_valid() {
// Note: --help overrides the subcommand, so this succeeds
// In actual use, --stdio would start the stdio server
assert!(output.status.success(), "pdftract mcp --stdio --help should succeed");
assert!(
output.status.success(),
"pdftract mcp --stdio --help should succeed"
);
}
/// Test that `pdftract mcp --bind ADDR` parses successfully.
@ -85,7 +111,10 @@ fn test_bind_flag_valid() {
// Note: --help overrides the subcommand, so this succeeds
// In actual use, --bind would start the HTTP server
assert!(output.status.success(), "pdftract mcp --bind ADDR --help should succeed");
assert!(
output.status.success(),
"pdftract mcp --bind ADDR --help should succeed"
);
}
/// Test that the help text mentions ADR-006 and the mutual exclusion rationale.
@ -99,10 +128,16 @@ fn test_help_mentions_adr_006() {
.output()
.expect("Failed to execute pdftract mcp --help");
assert!(output.status.success(), "pdftract mcp --help should succeed");
assert!(
output.status.success(),
"pdftract mcp --help should succeed"
);
let stdout = String::from_utf8_lossy(&output.stdout);
// Help text should mention ADR-006 and the rationale
assert!(stdout.contains("ADR-006"), "Help should mention ADR-006");
assert!(stdout.contains("mutually exclusive"), "Help should mention mutual exclusion");
assert!(
stdout.contains("mutually exclusive"),
"Help should mention mutual exclusion"
);
}

View file

@ -10,13 +10,13 @@
//! - Batch request handling
//! - Concurrent client handling (50 clients)
use std::process::{Command, Stdio, Child};
use std::thread;
use std::time::Duration;
use std::io::{BufRead, BufReader};
use std::net::TcpListener;
use reqwest::blocking::Client;
use serde_json::Value;
use std::io::{BufRead, BufReader};
use std::net::TcpListener;
use std::process::{Child, Command, Stdio};
use std::thread;
use std::time::Duration;
/// Find an available port for testing.
fn find_available_port() -> u16 {
@ -61,7 +61,8 @@ fn wait_for_server(port: u16, max_wait_ms: u64) -> bool {
let start = std::time::Instant::now();
while start.elapsed() < Duration::from_millis(max_wait_ms) {
if client.get(&format!("http://127.0.0.1:{}/health", port))
if client
.get(&format!("http://127.0.0.1:{}/health", port))
.send()
.map_or(false, |r| r.status().is_success())
{
@ -79,7 +80,10 @@ fn test_post_tools_list() {
let mut child = spawn_mcp_http(port);
// Wait for server to be ready
assert!(wait_for_server(port, 2000), "Server did not start within 2 seconds");
assert!(
wait_for_server(port, 2000),
"Server did not start within 2 seconds"
);
let client = Client::new();
let request_body = serde_json::json!({
@ -112,7 +116,10 @@ fn test_post_batch_request() {
let mut child = spawn_mcp_http(port);
// Wait for server to be ready
assert!(wait_for_server(port, 2000), "Server did not start within 2 seconds");
assert!(
wait_for_server(port, 2000),
"Server did not start within 2 seconds"
);
let client = Client::new();
let request_body = serde_json::json!([
@ -153,7 +160,10 @@ fn test_post_single_request_returns_single_response() {
let mut child = spawn_mcp_http(port);
// Wait for server to be ready
assert!(wait_for_server(port, 2000), "Server did not start within 2 seconds");
assert!(
wait_for_server(port, 2000),
"Server did not start within 2 seconds"
);
let client = Client::new();
let request_body = serde_json::json!({
@ -187,7 +197,10 @@ fn test_post_payload_too_large() {
let mut child = spawn_mcp_http_with_limit(port, 1);
// Wait for server to be ready
assert!(wait_for_server(port, 2000), "Server did not start within 2 seconds");
assert!(
wait_for_server(port, 2000),
"Server did not start within 2 seconds"
);
let client = Client::new();
// Create a payload larger than 1 MB
@ -209,7 +222,10 @@ fn test_post_payload_too_large() {
let json: Value = response.json().expect("Response is not valid JSON");
assert_eq!(json["error"]["code"], -32002);
assert!(json["error"]["message"].as_str().unwrap().contains("too large"));
assert!(json["error"]["message"]
.as_str()
.unwrap()
.contains("too large"));
// Clean shutdown
child.kill().ok();
@ -222,7 +238,10 @@ fn test_get_health() {
let mut child = spawn_mcp_http(port);
// Wait for server to be ready
assert!(wait_for_server(port, 2000), "Server did not start within 2 seconds");
assert!(
wait_for_server(port, 2000),
"Server did not start within 2 seconds"
);
let client = Client::new();
let response = client
@ -247,7 +266,10 @@ fn test_get_sse_stream() {
let mut child = spawn_mcp_http(port);
// Wait for server to be ready
assert!(wait_for_server(port, 2000), "Server did not start within 2 seconds");
assert!(
wait_for_server(port, 2000),
"Server did not start within 2 seconds"
);
let client = reqwest::blocking::Client::builder()
.timeout(None)
@ -260,8 +282,15 @@ fn test_get_sse_stream() {
.expect("Failed to send request");
assert_eq!(response.status(), reqwest::StatusCode::OK);
assert_eq!(response.headers().get("content-type").unwrap().to_str().unwrap(),
"text/event-stream");
assert_eq!(
response
.headers()
.get("content-type")
.unwrap()
.to_str()
.unwrap(),
"text/event-stream"
);
// Read the initial connection message
let reader = BufReader::new(response);
@ -269,7 +298,11 @@ fn test_get_sse_stream() {
// First line should be a comment (connected)
if let Some(Ok(line)) = lines.next() {
assert!(line.starts_with(": connected"), "Expected ': connected', got: {}", line);
assert!(
line.starts_with(": connected"),
"Expected ': connected', got: {}",
line
);
}
// Clean shutdown
@ -286,7 +319,10 @@ fn test_auth_required_for_non_loopback() {
let mut child = spawn_mcp_http(port);
// Wait for server to be ready
assert!(wait_for_server(port, 2000), "Server did not start within 2 seconds");
assert!(
wait_for_server(port, 2000),
"Server did not start within 2 seconds"
);
let client = Client::new();
let request_body = serde_json::json!({
@ -316,7 +352,10 @@ fn test_unknown_method() {
let mut child = spawn_mcp_http(port);
// Wait for server to be ready
assert!(wait_for_server(port, 2000), "Server did not start within 2 seconds");
assert!(
wait_for_server(port, 2000),
"Server did not start within 2 seconds"
);
let client = Client::new();
let request_body = serde_json::json!({
@ -351,7 +390,10 @@ fn test_50_concurrent_clients() {
let mut child = spawn_mcp_http(port);
// Wait for server to be ready
assert!(wait_for_server(port, 2000), "Server did not start within 2 seconds");
assert!(
wait_for_server(port, 2000),
"Server did not start within 2 seconds"
);
let client = reqwest::blocking::Client::builder()
.timeout(Duration::from_secs(5))
@ -372,10 +414,7 @@ fn test_50_concurrent_clients() {
let url = format!("http://127.0.0.1:{}/", port);
thread::spawn(move || {
let response = client
.post(&url)
.json(&request_body)
.send();
let response = client.post(&url).json(&request_body).send();
(i, response)
})
@ -413,7 +452,11 @@ fn test_50_concurrent_clients() {
// All 50 clients should succeed without 5xx errors
assert_eq!(five_xx_count, 0, "Got {} 5xx errors", five_xx_count);
assert_eq!(error_count, 0, "Got {} errors", error_count);
assert_eq!(success_count, 50, "Got {} successes, expected 50", success_count);
assert_eq!(
success_count, 50,
"Got {} successes, expected 50",
success_count
);
// Clean shutdown
child.kill().ok();
@ -426,7 +469,10 @@ fn test_health_during_load() {
let mut child = spawn_mcp_http(port);
// Wait for server to be ready
assert!(wait_for_server(port, 2000), "Server did not start within 2 seconds");
assert!(
wait_for_server(port, 2000),
"Server did not start within 2 seconds"
);
let client = reqwest::blocking::Client::builder()
.timeout(Duration::from_secs(5))
@ -446,9 +492,7 @@ fn test_health_during_load() {
let request_body = request_body.clone();
let url = format!("http://127.0.0.1:{}/", port);
thread::spawn(move || {
client.post(&url).json(&request_body).send()
})
thread::spawn(move || client.post(&url).json(&request_body).send())
})
.collect();

View file

@ -25,7 +25,10 @@ fn spawn_mcp_stdio() -> std::process::Child {
}
/// Helper to write a framed JSON-RPC message to stdin.
fn write_framed_message(stdin: &mut std::process::ChildStdin, json_body: &str) -> std::io::Result<()> {
fn write_framed_message(
stdin: &mut std::process::ChildStdin,
json_body: &str,
) -> std::io::Result<()> {
let header = format!("Content-Length: {}\r\n\r\n", json_body.len());
stdin.write_all(header.as_bytes())?;
stdin.write_all(json_body.as_bytes())?;
@ -52,13 +55,20 @@ fn read_framed_response<R: Read>(reader: &mut BufReader<R>) -> std::io::Result<O
}
if let Some(value) = line.strip_prefix("Content-Length:") {
content_length = Some(value.trim().parse::<usize>()
.map_err(|e| std::io::Error::new(std::io::ErrorKind::InvalidData, e))?);
content_length = Some(
value
.trim()
.parse::<usize>()
.map_err(|e| std::io::Error::new(std::io::ErrorKind::InvalidData, e))?,
);
}
}
let content_length = content_length.ok_or_else(|| {
std::io::Error::new(std::io::ErrorKind::InvalidData, "Missing Content-Length header")
std::io::Error::new(
std::io::ErrorKind::InvalidData,
"Missing Content-Length header",
)
})?;
let mut buffer = vec![0u8; content_length];
@ -98,8 +108,8 @@ fn test_tools_list_roundtrip() {
assert!(response.contains(r#""result""#));
// Verify it's valid JSON
let parsed: serde_json::Value = serde_json::from_str(&response)
.expect("Response is not valid JSON");
let parsed: serde_json::Value =
serde_json::from_str(&response).expect("Response is not valid JSON");
assert_eq!(parsed["jsonrpc"], "2.0");
assert_eq!(parsed["id"], 1);
@ -135,7 +145,11 @@ fn test_eof_clean_shutdown() {
}
};
assert!(status.success(), "Process did not exit cleanly: {:?}", status);
assert!(
status.success(),
"Process did not exit cleanly: {:?}",
status
);
}
/// Test that a parse error returns -32700 with id: null.
@ -186,8 +200,7 @@ fn test_parse_error_recovery() {
{
let stdout = child.stdout.as_mut().expect("Failed to open stdout");
let mut reader = BufReader::new(stdout);
read_framed_response(&mut reader)
.expect("Failed to read error response");
read_framed_response(&mut reader).expect("Failed to read error response");
}
// Now send a valid request
@ -253,18 +266,24 @@ fn test_stdout_json_rpc_only() {
child.kill().ok();
// Verify stdout is valid framed JSON-RPC
assert!(response.contains(r#"{"jsonrpc":"2.0""#), "Missing JSON-RPC response");
assert!(
response.contains(r#"{"jsonrpc":"2.0""#),
"Missing JSON-RPC response"
);
assert!(response.contains(r#""result""#), "Missing result field");
// Verify stderr contains logs (logs go to stderr, not stdout)
// The startup banner or other logs should be in stderr
let stderr_has_logs = !stderr_output.is_empty() ||
stderr_output.contains("pdftract") ||
stderr_output.contains("stdio") ||
stderr_output.contains("MCP") ||
stderr_output.contains("Signal");
assert!(stderr_has_logs || stderr_output.is_empty(),
"Stderr should contain logs, got: {}", stderr_output);
let stderr_has_logs = !stderr_output.is_empty()
|| stderr_output.contains("pdftract")
|| stderr_output.contains("stdio")
|| stderr_output.contains("MCP")
|| stderr_output.contains("Signal");
assert!(
stderr_has_logs || stderr_output.is_empty(),
"Stderr should contain logs, got: {}",
stderr_output
);
}
/// Test timing: request-response should complete within 50ms.
@ -291,8 +310,11 @@ fn test_request_response_timing() {
}
let elapsed = start.elapsed();
assert!(elapsed < Duration::from_millis(100),
"Request-response took {:?}, expected < 50ms", elapsed);
assert!(
elapsed < Duration::from_millis(100),
"Request-response took {:?}, expected < 50ms",
elapsed
);
// Clean shutdown
drop(child.stdin.take());
@ -362,7 +384,10 @@ fn test_notification_no_response() {
// Notifications don't get responses, so we shouldn't see data immediately
// (unless there's buffering from a previous request)
// For this test, we just verify the process is still alive
assert!(child.try_wait().unwrap().is_none(), "Process died unexpectedly");
assert!(
child.try_wait().unwrap().is_none(),
"Process died unexpectedly"
);
// Clean shutdown
drop(child.stdin.take());

View file

@ -105,7 +105,10 @@ fn test_phase_7_stub_tools_return_not_implemented() {
let registry = tools::all_tools();
let stub_tools = [
("get_table", serde_json::json!({"path": "test.pdf", "page": 0, "table_index": 0})),
(
"get_table",
serde_json::json!({"path": "test.pdf", "page": 0, "table_index": 0}),
),
("get_form_fields", serde_json::json!({"path": "test.pdf"})),
("get_attachments", serde_json::json!({"path": "test.pdf"})),
("classify", serde_json::json!({"path": "test.pdf"})),
@ -161,7 +164,10 @@ fn test_extract_tool_with_real_pdf() {
let result = tool.execute(args, None, None);
if let Err(ref e) = result {
eprintln!("Error from tool: code={}, message={}, data={:?}", e.code, e.message, e.data);
eprintln!(
"Error from tool: code={}, message={}, data={:?}",
e.code, e.message, e.data
);
}
assert!(result.is_ok(), "Tool should succeed: {:?}", result);
@ -210,7 +216,10 @@ fn test_path_resolution() {
// Also check using CARGO_MANIFEST_DIR
if let Ok(manifest_dir) = std::env::var("CARGO_MANIFEST_DIR") {
let abs_path = format!("{}/{}", manifest_dir, "../../tests/sdk-conformance/fixtures/large/100pages.pdf");
let abs_path = format!(
"{}/{}",
manifest_dir, "../../tests/sdk-conformance/fixtures/large/100pages.pdf"
);
let exists = std::path::Path::new(&abs_path).exists();
println!("Absolute path '{}' exists: {}", abs_path, exists);
}
@ -252,7 +261,10 @@ fn test_encrypted_pdf_returns_pdf_encrypted_error() {
// Debug: print the result if it succeeds unexpectedly
if let Ok(ref response) = result {
eprintln!("Unexpected success on encrypted PDF: {}", serde_json::to_string_pretty(response).unwrap());
eprintln!(
"Unexpected success on encrypted PDF: {}",
serde_json::to_string_pretty(response).unwrap()
);
}
assert!(result.is_err(), "Encrypted PDF should return error");

View file

@ -25,7 +25,10 @@ fn test_acceptance_criteria_path_traversal_rejected() {
let result = resolve_path("../../../etc/passwd", Some(root));
assert!(result.is_err());
let err = result.unwrap_err();
assert_eq!(err.code, -32602, "Should return -32602 (Invalid params) for path traversal");
assert_eq!(
err.code, -32602,
"Should return -32602 (Invalid params) for path traversal"
);
assert!(err.message.contains("escapes root"));
}
@ -67,7 +70,10 @@ fn test_acceptance_criteria_https_url_bypasses_check() {
let result = resolve_path("https://example.com/file.pdf", Some(root));
assert!(result.is_ok());
assert_eq!(result.unwrap(), std::path::PathBuf::from("https://example.com/file.pdf"));
assert_eq!(
result.unwrap(),
std::path::PathBuf::from("https://example.com/file.pdf")
);
}
#[test]
@ -75,7 +81,10 @@ fn test_acceptance_criteria_no_root_trust_the_caller() {
// Without --root, paths should be returned as-is (trust-the-caller mode)
let result = resolve_path("../../../etc/passwd", None);
assert!(result.is_ok());
assert_eq!(result.unwrap(), std::path::PathBuf::from("../../../etc/passwd"));
assert_eq!(
result.unwrap(),
std::path::PathBuf::from("../../../etc/passwd")
);
}
#[test]
@ -92,10 +101,8 @@ fn test_acceptance_criteria_symlink_escape_rejected() {
#[cfg(windows)]
{
std::os::windows::fs::symlink_file(
r"C:\Windows\System32\drivers\etc\hosts",
&symlink_path
).unwrap();
std::os::windows::fs::symlink_file(r"C:\Windows\System32\drivers\etc\hosts", &symlink_path)
.unwrap();
}
// Try to access the symlink
@ -134,7 +141,10 @@ fn test_plan_critical_test_path_traversal_with_root() {
let result = resolve_path("../../etc/passwd", Some(root));
assert!(result.is_err());
let err = result.unwrap_err();
assert_eq!(err.code, -32602, "Critical test: path traversal must return -32602");
assert_eq!(
err.code, -32602,
"Critical test: path traversal must return -32602"
);
assert!(err.message.contains("escapes root"));
// Verify the error data contains the expected code
@ -152,7 +162,10 @@ fn test_http_url_bypasses_check() {
let result = resolve_path("http://example.com/file.pdf", Some(root));
assert!(result.is_ok());
assert_eq!(result.unwrap(), std::path::PathBuf::from("http://example.com/file.pdf"));
assert_eq!(
result.unwrap(),
std::path::PathBuf::from("http://example.com/file.pdf")
);
}
#[test]
@ -205,6 +218,10 @@ fn test_complex_path_traversal_patterns() {
let result = resolve_path(pattern, Some(root));
assert!(result.is_err(), "Pattern '{}' should be rejected", pattern);
let err = result.unwrap_err();
assert_eq!(err.code, -32602, "Pattern '{}' should return -32602", pattern);
assert_eq!(
err.code, -32602,
"Pattern '{}' should return -32602",
pattern
);
}
}

View file

@ -3,12 +3,12 @@
// Tests the performance of line-based and borderless table detection
// on pages with varying numbers of path segments and text positions.
use criterion::{black_box, criterion_group, criterion_main, Criterion, BenchmarkId};
use pdftract_core::table::{TableDetector, PageContext};
use pdftract_core::parser::pages::PageDict;
use std::sync::Arc;
use criterion::{black_box, criterion_group, criterion_main, BenchmarkId, Criterion};
use pdftract_core::parser::object::ObjRef;
use pdftract_core::parser::pages::PageDict;
use pdftract_core::parser::resources::ResourceDict;
use pdftract_core::table::{PageContext, TableDetector};
use std::sync::Arc;
fn make_page() -> PageDict {
PageDict {
@ -99,9 +99,7 @@ fn bench_table_detection(c: &mut Criterion) {
let content = generate_grid_content(num_horiz, num_vert);
let ctx = PageContext::new(&page, &content);
b.iter(|| {
black_box(detector.detect_line_based(black_box(&ctx)))
});
b.iter(|| black_box(detector.detect_line_based(black_box(&ctx))));
},
);
}
@ -111,9 +109,7 @@ fn bench_table_detection(c: &mut Criterion) {
let content = generate_grid_content(500, 500);
let ctx = PageContext::new(&page, &content);
b.iter(|| {
black_box(detector.detect_line_based(black_box(&ctx)))
});
b.iter(|| black_box(detector.detect_line_based(black_box(&ctx))));
});
group.finish();
@ -135,9 +131,7 @@ fn bench_borderless_detection(c: &mut Criterion) {
let content = generate_borderless_content(num_rows, num_cols);
let ctx = PageContext::new(&page, &content);
b.iter(|| {
black_box(detector.detect_borderless(black_box(&ctx)))
});
b.iter(|| black_box(detector.detect_borderless(black_box(&ctx))));
},
);
}

View file

@ -33,37 +33,42 @@ fn main() {
}
fn generate_std14_metrics(out_dir: &Path, metrics_path: &Path) {
let json_content = fs::read_to_string(metrics_path).expect("Failed to read std14-metrics.json");
let json_content = fs::read_to_string(metrics_path)
.expect("Failed to read std14-metrics.json");
let data: serde_json::Value =
serde_json::from_str(&json_content).expect("Failed to parse std14-metrics.json");
let data: serde_json::Value = serde_json::from_str(&json_content)
.expect("Failed to parse std14-metrics.json");
let fonts = data["fonts"].as_object()
.expect("fonts object missing");
let fonts = data["fonts"].as_object().expect("fonts object missing");
let mut metrics_structs = String::new();
for (font_name, font_data) in fonts {
let font_ident = font_name.replace("-", "_");
let weights = font_data["weights"].as_array()
let weights = font_data["weights"]
.as_array()
.expect("weights array missing");
let weights_array: Vec<String> = weights.iter()
let weights_array: Vec<String> = weights
.iter()
.map(|v| v.as_u64().unwrap_or(0).to_string())
.collect();
let font_bbox = font_data["font_bbox"].as_array()
let font_bbox = font_data["font_bbox"]
.as_array()
.expect("font_bbox array missing");
let font_bbox: Vec<String> = font_bbox.iter()
let font_bbox: Vec<String> = font_bbox
.iter()
.map(|v| v.as_i64().unwrap_or(0).to_string())
.collect();
let ascent = font_data["ascent"].as_i64().expect("ascent missing");
let descent = font_data["descent"].as_i64().expect("descent missing");
let italic_angle = font_data["italic_angle"].as_f64().expect("italic_angle missing");
let cap_height = font_data["cap_height"].as_i64().expect("cap_height missing");
let italic_angle = font_data["italic_angle"]
.as_f64()
.expect("italic_angle missing");
let cap_height = font_data["cap_height"]
.as_i64()
.expect("cap_height missing");
let stem_v = font_data["stem_v"].as_i64().expect("stem_v missing");
let encoding_str = font_data["encoding"].as_str().expect("encoding missing");
@ -74,7 +79,8 @@ fn generate_std14_metrics(out_dir: &Path, metrics_path: &Path) {
_ => "NamedEncoding::Standard",
};
metrics_structs.push_str(&format!(r#"
metrics_structs.push_str(&format!(
r#"
static {}_WIDTHS: &[u16; 256] = &[{}];
static {}_METRICS: Std14Metrics = Std14Metrics {{
widths: &{}_WIDTHS,
@ -106,10 +112,14 @@ static {}_METRICS: Std14Metrics = Std14Metrics {{
for font_name in fonts.keys() {
let ident = font_name.replace("-", "_");
map_builder.entry(font_name.as_str(), &format!("&{}_METRICS", ident.to_uppercase()));
map_builder.entry(
font_name.as_str(),
&format!("&{}_METRICS", ident.to_uppercase()),
);
}
let rust_code = format!(r#"
let rust_code = format!(
r#"
// Auto-generated Standard 14 font metrics.
// Do not edit manually.
@ -129,14 +139,13 @@ pub fn get_std14_metrics(name: &str) -> Option<&'static Std14Metrics> {{
}
fn generate_named_encodings(out_dir: &Path, encodings_path: &Path) {
let json_content = fs::read_to_string(encodings_path)
.expect("Failed to read named-encodings.json");
let json_content =
fs::read_to_string(encodings_path).expect("Failed to read named-encodings.json");
let data: serde_json::Value = serde_json::from_str(&json_content)
.expect("Failed to parse named-encodings.json");
let data: serde_json::Value =
serde_json::from_str(&json_content).expect("Failed to parse named-encodings.json");
let encodings = data.as_object()
.expect("encodings object missing");
let encodings = data.as_object().expect("encodings object missing");
let mut encoding_arrays = String::new();
@ -151,7 +160,8 @@ fn generate_named_encodings(out_dir: &Path, encodings_path: &Path) {
_ => continue,
};
let entries = encoding_data.as_object()
let entries = encoding_data
.as_object()
.expect("encoding data is not an object");
let mut array_values = Vec::new();
@ -165,7 +175,8 @@ fn generate_named_encodings(out_dir: &Path, encodings_path: &Path) {
array_values.push(rust_value);
}
encoding_arrays.push_str(&format!(r#"
encoding_arrays.push_str(&format!(
r#"
pub static {}: [Option<&'static str>; 256] = [
{}];
"#,
@ -174,7 +185,8 @@ pub static {}: [Option<&'static str>; 256] = [
));
}
let rust_code = format!(r#"
let rust_code = format!(
r#"
// Auto-generated named encoding tables.
// Do not edit manually.
// Source: ISO 32000-1 Annex D
@ -200,39 +212,39 @@ pub fn get_named_encoding_table(encoding: NamedEncoding) -> &'static [Option<&'s
}
fn generate_agl_maps(out_dir: &Path, agl_path: &Path) {
let json_content = fs::read_to_string(agl_path)
.expect("Failed to read agl.json");
let json_content = fs::read_to_string(agl_path).expect("Failed to read agl.json");
let data: serde_json::Value = serde_json::from_str(&json_content)
.expect("Failed to parse agl.json");
let data: serde_json::Value =
serde_json::from_str(&json_content).expect("Failed to parse agl.json");
// Single-codepoint map
let single = data["merged_single"].as_object()
let single = data["merged_single"]
.as_object()
.expect("merged_single object missing");
let mut single_map_builder = phf_codegen::Map::new();
for (name, uvalue) in single {
let uvalue_str = uvalue.as_str()
.expect("unicode value is not a string");
let uvalue_str = uvalue.as_str().expect("unicode value is not a string");
// Parse the JSON unicode escape like "A" into a Rust char literal
let unicode_char = decode_json_unicode(uvalue_str);
single_map_builder.entry(name.as_str(), &format!("'\\u{{{}}}'", unicode_char));
}
// Multi-codepoint map
let multi = data["merged_multi"].as_object()
let multi = data["merged_multi"]
.as_object()
.expect("merged_multi object missing");
let mut multi_arrays = String::new();
let mut multi_map_builder = phf_codegen::Map::new();
for (name, uvalues) in multi {
let uvalues_arr = uvalues.as_array()
.expect("multi value is not an array");
let uvalues_arr = uvalues.as_array().expect("multi value is not an array");
let ident = name.to_uppercase().replace("-", "_").replace(".", "_");
let chars: Vec<String> = uvalues_arr.iter()
let chars: Vec<String> = uvalues_arr
.iter()
.map(|v| {
let uvalue_str = v.as_str().expect("unicode value is not a string");
let unicode_char = decode_json_unicode(uvalue_str);
@ -240,7 +252,8 @@ fn generate_agl_maps(out_dir: &Path, agl_path: &Path) {
})
.collect();
multi_arrays.push_str(&format!(r#"
multi_arrays.push_str(&format!(
r#"
static {}: &[char] = &[{}];
"#,
ident,
@ -250,7 +263,8 @@ static {}: &[char] = &[{}];
multi_map_builder.entry(name.as_str(), &format!("&{}", ident));
}
let rust_code = format!(r#"
let rust_code = format!(
r#"
// Auto-generated Adobe Glyph List (AGL) phf maps.
// Do not edit manually.
// Source: Adobe Glyph List 1.4 + AGLFN 1.7
@ -271,8 +285,7 @@ pub static AGL_MULTI: phf::Map<&'static str, &[char]> = {};
multi_map_builder.build()
);
fs::write(Path::new(out_dir).join("agl.rs"), rust_code)
.expect("Failed to write agl.rs");
fs::write(Path::new(out_dir).join("agl.rs"), rust_code).expect("Failed to write agl.rs");
}
/// Decode a JSON unicode escape string like "\\u0041" to "0041".
@ -302,14 +315,13 @@ fn decode_json_unicode(s: &str) -> String {
/// Each entry maps a glyph ID to a Unicode codepoint for a specific font
/// identified by its SHA-256 hash.
fn generate_font_fingerprints(out_dir: &Path, fingerprints_path: &Path) {
let json_content = fs::read_to_string(fingerprints_path)
.expect("Failed to read font-fingerprints.json");
let json_content =
fs::read_to_string(fingerprints_path).expect("Failed to read font-fingerprints.json");
let data: serde_json::Value = serde_json::from_str(&json_content)
.expect("Failed to parse font-fingerprints.json");
let data: serde_json::Value =
serde_json::from_str(&json_content).expect("Failed to parse font-fingerprints.json");
let fonts = data.as_array()
.expect("font-fingerprints must be an array");
let fonts = data.as_array().expect("font-fingerprints must be an array");
let mut entries_arrays = String::new();
let mut map_builder = phf_codegen::Map::new();
@ -319,7 +331,8 @@ fn generate_font_fingerprints(out_dir: &Path, fingerprints_path: &Path) {
let mut values = Vec::new();
for font_entry in fonts {
let sha256_hex = font_entry.get("sha256_hex")
let sha256_hex = font_entry
.get("sha256_hex")
.and_then(|v| v.as_str())
.expect("sha256_hex must be a string");
@ -330,14 +343,18 @@ fn generate_font_fingerprints(out_dir: &Path, fingerprints_path: &Path) {
// Validate SHA-256 hex (64 hex chars = 32 bytes)
if sha256_hex.len() != 64 {
panic!("SHA-256 hex must be 64 characters, got {}", sha256_hex.len());
panic!(
"SHA-256 hex must be 64 characters, got {}",
sha256_hex.len()
);
}
// Convert hex string to [u8; 32] bytes
let hash_bytes: [u8; 32] = hex_decode_to_array(sha256_hex);
// Get entries
let entries = font_entry.get("entries")
let entries = font_entry
.get("entries")
.and_then(|v| v.as_array())
.expect("entries must be an array");
@ -347,8 +364,14 @@ fn generate_font_fingerprints(out_dir: &Path, fingerprints_path: &Path) {
let mut entry_values = Vec::new();
for entry in entries {
let arr = entry.as_array().expect("entry must be an array");
let gid = arr.get(0).and_then(|v| v.as_u64()).expect("gid must be a number") as u16;
let codepoint = arr.get(1).and_then(|v| v.as_u64()).expect("codepoint must be a number") as u32;
let gid = arr
.get(0)
.and_then(|v| v.as_u64())
.expect("gid must be a number") as u16;
let codepoint = arr
.get(1)
.and_then(|v| v.as_u64())
.expect("codepoint must be a number") as u32;
// Validate codepoint is a valid Unicode scalar value
if !is_valid_unicode_scalar(codepoint) {
@ -358,7 +381,8 @@ fn generate_font_fingerprints(out_dir: &Path, fingerprints_path: &Path) {
entry_values.push(format!("({}, {})", gid, codepoint));
}
entries_arrays.push_str(&format!(r#"
entries_arrays.push_str(&format!(
r#"
static {}: &[(u16, u32)] = &[{}];
"#,
ident,
@ -366,9 +390,7 @@ static {}: &[(u16, u32)] = &[{}];
));
// Build the phf map key as a byte array literal
let key_bytes: Vec<String> = hash_bytes.iter()
.map(|b| format!("0x{:02x}", b))
.collect();
let key_bytes: Vec<String> = hash_bytes.iter().map(|b| format!("0x{:02x}", b)).collect();
let key = format!("[{}]", key_bytes.join(", "));
let value = format!("&{}", ident);
@ -382,7 +404,8 @@ static {}: &[(u16, u32)] = &[{}];
map_builder.entry(key.as_str(), value.as_str());
}
let rust_code = format!(r#"
let rust_code = format!(
r#"
// Auto-generated font fingerprint phf map.
// Do not edit manually.
// Source: build/font-fingerprints.json
@ -415,8 +438,7 @@ fn hex_decode_to_array(hex: &str) -> [u8; 32] {
let mut bytes = [0u8; 32];
for i in 0..32 {
let byte_str = &hex[i * 2..i * 2 + 2];
bytes[i] = u8::from_str_radix(byte_str, 16)
.expect("Invalid hex string");
bytes[i] = u8::from_str_radix(byte_str, 16).expect("Invalid hex string");
}
bytes
}
@ -450,7 +472,8 @@ fn generate_collection_cmap(out_dir: &Path, base_dir: &Path, json_name: &str, mo
// Check if the JSON file exists
if !json_path.exists() {
// Generate a stub implementation
let rust_code = format!(r#"
let rust_code = format!(
r#"
// Auto-generated {collection} CID to Unicode mapping.
//
// Source: {json_name}.json (not found - stub implementation)
@ -469,13 +492,12 @@ pub fn cid_to_unicode(cid: u32) -> Option<&'static [char]> {{
json_name = json_name,
);
fs::write(&out_path, rust_code)
.expect(&format!("Failed to write {}", out_path.display()));
fs::write(&out_path, rust_code).expect(&format!("Failed to write {}", out_path.display()));
return;
}
let json_content = fs::read_to_string(&json_path)
.expect(&format!("Failed to read {}", json_path.display()));
let json_content =
fs::read_to_string(&json_path).expect(&format!("Failed to read {}", json_path.display()));
let data: serde_json::Value = serde_json::from_str(&json_content)
.expect(&format!("Failed to parse {}", json_path.display()));
@ -486,7 +508,8 @@ pub fn cid_to_unicode(cid: u32) -> Option<&'static [char]> {{
if let Some(mappings) = data.as_object() {
for (cid_str, unicode_value) in mappings {
let cid: u32 = cid_str.parse()
let cid: u32 = cid_str
.parse()
.expect(&format!("Invalid CID key: {}", cid_str));
// Parse the Unicode value
@ -497,11 +520,13 @@ pub fn cid_to_unicode(cid: u32) -> Option<&'static [char]> {{
let array_ident = format!("CID_{}_{}", module_name.to_uppercase(), cid);
// Build the array
let char_literals: Vec<String> = chars.iter()
let char_literals: Vec<String> = chars
.iter()
.map(|c| format!("'\\u{{{:04X}}}'", *c as u32))
.collect();
arrays.push_str(&format!(r#"
arrays.push_str(&format!(
r#"
static {}: &[char] = &[{}];
"#,
array_ident,
@ -514,7 +539,8 @@ static {}: &[char] = &[{}];
}
}
let rust_code = format!(r#"
let rust_code = format!(
r#"
// Auto-generated {collection} CID to Unicode mapping.
//
// Source: {json_name}.json
@ -542,8 +568,7 @@ pub fn cid_to_unicode(cid: u32) -> Option<&'static [char]> {{
map = map_builder.build(),
);
fs::write(&out_path, rust_code)
.expect(&format!("Failed to write {}", out_path.display()));
fs::write(&out_path, rust_code).expect(&format!("Failed to write {}", out_path.display()));
}
/// Parse a Unicode value from JSON to a Vec<char>.

View file

@ -1,8 +1,11 @@
use std::sync::Arc;
use indexmap::IndexMap;
use std::sync::Arc;
fn main() {
println!("IndexMap<Arc<str>, ()>: {}", std::mem::size_of::<IndexMap<Arc<str>, ()>>());
println!(
"IndexMap<Arc<str>, ()>: {}",
std::mem::size_of::<IndexMap<Arc<str>, ()>>()
);
println!("Vec<u8>: {}", std::mem::size_of::<Vec<u8>>());
println!("Vec<()>: {}", std::mem::size_of::<Vec<()>>());
println!("Arc<str>: {}", std::mem::size_of::<Arc<str>>());

View file

@ -1,9 +1,9 @@
// Simple test to verify forward_scan_xref functionality
// This is a standalone test file to verify the forward scan implementation
use std::collections::HashMap;
use pdftract_core::parser::xref::{XrefEntry, XrefSection, forward_scan_xref};
use pdftract_core::parser::stream::MemorySource;
use pdftract_core::parser::xref::{forward_scan_xref, XrefEntry, XrefSection};
use std::collections::HashMap;
fn main() {
println!("Testing forward_scan_xref implementation...\n");
@ -44,7 +44,10 @@ fn main() {
let source = MemorySource::new(pdf_data.to_vec());
let result = forward_scan_xref(&source, false);
println!(" Found {} objects (including the one after truncated xref)", result.len());
println!(
" Found {} objects (including the one after truncated xref)",
result.len()
);
assert!(result.len() >= 4, "Expected at least 4 objects");
println!(" ✓ PASSED\n");
@ -57,8 +60,13 @@ fn main() {
println!(" Found {} objects (should be 0)", result.len());
assert_eq!(result.len(), 0, "Expected 0 objects for linearized file");
println!(" Has LINEARIZED_NO_FORWARD_SCAN diagnostic: {}",
result.diagnostics.iter().any(|d| matches!(d.code, pdftract_core::parser::xref::XrefDiagCode::LinearizedNoForwardScan)));
println!(
" Has LINEARIZED_NO_FORWARD_SCAN diagnostic: {}",
result.diagnostics.iter().any(|d| matches!(
d.code,
pdftract_core::parser::xref::XrefDiagCode::LinearizedNoForwardScan
))
);
println!(" ✓ PASSED\n");
// Test 4: Multi-revision - last occurrence wins
@ -88,9 +96,16 @@ fn main() {
let source = MemorySource::new(pdf_data.to_vec());
let result = forward_scan_xref(&source, false);
let has_repaired_diagnostic = result.diagnostics.iter()
.any(|d| matches!(d.code, pdftract_core::parser::xref::XrefDiagCode::XrefRepaired));
println!(" Has XREF_REPAIRED diagnostic: {}", has_repaired_diagnostic);
let has_repaired_diagnostic = result.diagnostics.iter().any(|d| {
matches!(
d.code,
pdftract_core::parser::xref::XrefDiagCode::XrefRepaired
)
});
println!(
" Has XREF_REPAIRED diagnostic: {}",
has_repaired_diagnostic
);
assert!(has_repaired_diagnostic, "Expected XREF_REPAIRED diagnostic");
println!(" ✓ PASSED\n");

View file

@ -1,26 +1,32 @@
use lzw::{MsbReader, Decoder, DecoderEarlyChange};
use lzw::{Decoder, DecoderEarlyChange, MsbReader};
fn main() {
// Test basic encoding/decoding
let data = b"hello world!";
// Encode with early change
let mut encoder = lzw::EncoderEarlyChange::new(lzw::MsbWriter::new(), 8);
let encoded_early: Vec<u8> = encoder.encode_bytes(data).0;
println!("Encoded (early change): {:02x?}", encoded_early);
// Decode with early change
let mut decoder = DecoderEarlyChange::new(MsbReader::new(), 8);
let (consumed, decoded) = decoder.decode_bytes(&encoded_early).unwrap();
println!("Decoded (early change): {:?}", std::str::from_utf8(decoded).unwrap());
println!(
"Decoded (early change): {:?}",
std::str::from_utf8(decoded).unwrap()
);
// Encode with late change
let mut encoder2 = lzw::Encoder::new(lzw::MsbWriter::new(), 8);
let encoded_late: Vec<u8> = encoder2.encode_bytes(data).0;
println!("Encoded (late change): {:02x?}", encoded_late);
// Decode with late change
let mut decoder2 = Decoder::new(MsbReader::new(), 8);
let (consumed2, decoded2) = decoder2.decode_bytes(&encoded_late).unwrap();
println!("Decoded (late change): {:?}", std::str::from_utf8(decoded2).unwrap());
println!(
"Decoded (late change): {:?}",
std::str::from_utf8(decoded2).unwrap()
);
}

View file

@ -1,5 +1,5 @@
use pdftract_core::parser::xref;
use pdftract_core::parser::stream::{MemorySource, PdfSource};
use pdftract_core::parser::xref;
use std::fs::File;
use std::io::Read;
@ -12,7 +12,10 @@ fn main() {
// Find startxref BEFORE moving buffer
let search_bytes = &buffer[buffer.len().saturating_sub(1024)..];
let pos = search_bytes.windows(9).rposition(|w| w == b"startxref").unwrap();
let pos = search_bytes
.windows(9)
.rposition(|w| w == b"startxref")
.unwrap();
let start = buffer.len().saturating_sub(1024) + pos + 9;
// Skip whitespace
@ -31,21 +34,24 @@ fn main() {
// Now create source
let source = MemorySource::new(buffer);
println!("startxref offset: {}", start_offset);
let xref_section = xref::load_xref_with_prev_chain(&source, start_offset);
println!("Has trailer: {}", xref_section.trailer.is_some());
if let Some(trailer) = &xref_section.trailer {
println!("Trailer keys: {:?}", trailer.keys().collect::<Vec<_>>());
println!("Root entry: {:?}", trailer.get("Root"));
println!("Size entry: {:?}", trailer.get("Size"));
}
println!("Diagnostics count: {}", xref_section.diagnostics.len());
for diag in &xref_section.diagnostics {
println!(" - {}: {} at byte_offset {:?}", diag.code, diag.message, diag.byte_offset);
println!(
" - {}: {} at byte_offset {:?}",
diag.code, diag.message, diag.byte_offset
);
}
}

View file

@ -20,9 +20,9 @@
//! - "EncryptedPayload": The file is an encrypted payload
//! - "Unspecified": No specific relationship (default)
use crate::diagnostics::{DiagCode, Diagnostic};
use crate::parser::object::ObjRef;
use crate::parser::xref::XrefResolver;
use crate::diagnostics::{Diagnostic, DiagCode};
/// Result type for /AF parsing.
pub type Result<T> = std::result::Result<T, Vec<Diagnostic>>;
@ -119,7 +119,11 @@ pub fn walk_af_array(
None => {
diagnostics.push(Diagnostic::with_dynamic_no_offset(
DiagCode::StructInvalidType,
format!("/AF[{}] is not a reference (type: {})", idx, entry_obj.type_name()),
format!(
"/AF[{}] is not a reference (type: {})",
idx,
entry_obj.type_name()
),
));
continue;
}
@ -179,19 +183,21 @@ fn extract_af_relationship(
None => {
diagnostics.push(Diagnostic::with_dynamic_no_offset(
DiagCode::StructInvalidType,
format!("Filespec {} is not a dictionary (type: {})", filespec_ref, filespec_obj.type_name()),
format!(
"Filespec {} is not a dictionary (type: {})",
filespec_ref,
filespec_obj.type_name()
),
));
return Err(diagnostics);
}
};
// Extract /AFRelationship (optional)
let relationship = filespec_dict
.get("/AFRelationship")
.and_then(|obj| {
// /AFRelationship is typically a Name object
obj.as_name().map(|s| s.to_string())
});
let relationship = filespec_dict.get("/AFRelationship").and_then(|obj| {
// /AFRelationship is typically a Name object
obj.as_name().map(|s| s.to_string())
});
Ok(relationship)
}
@ -203,11 +209,7 @@ mod tests {
use indexmap::IndexMap;
/// Helper to create a test Filespec dictionary.
fn make_filespec(
resolver: &XrefResolver,
obj_ref: ObjRef,
relationship: Option<&str>,
) {
fn make_filespec(resolver: &XrefResolver, obj_ref: ObjRef, relationship: Option<&str>) {
let mut dict = IndexMap::new();
dict.insert(intern("/Type"), PdfObject::Name(intern("Filespec")));
dict.insert(intern("/F"), PdfObject::Name(intern("test.pdf")));
@ -326,7 +328,9 @@ mod tests {
assert!(result.is_err());
let diagnostics = result.unwrap_err();
assert!(diagnostics.iter().any(|d| d.message.contains("not an array")));
assert!(diagnostics
.iter()
.any(|d| d.message.contains("not an array")));
}
#[test]
@ -350,15 +354,14 @@ mod tests {
assert!(result.is_err());
let diagnostics = result.unwrap_err();
assert!(diagnostics.iter().any(|d| d.message.contains("not a reference")));
assert!(diagnostics
.iter()
.any(|d| d.message.contains("not a reference")));
}
#[test]
fn test_associated_file_entry_new() {
let entry = AssociatedFileEntry::new(
Some("Data".to_string()),
ObjRef::new(42, 0),
);
let entry = AssociatedFileEntry::new(Some("Data".to_string()), ObjRef::new(42, 0));
assert_eq!(entry.relationship, Some("Data".to_string()));
assert_eq!(entry.filespec_ref, ObjRef::new(42, 0));
@ -428,7 +431,10 @@ mod tests {
assert_eq!(entries[2].filespec_ref, fs3);
assert_eq!(entries[0].relationship, Some("Unspecified".to_string()));
assert_eq!(entries[1].relationship, Some("EncryptedPayload".to_string()));
assert_eq!(
entries[1].relationship,
Some("EncryptedPayload".to_string())
);
assert_eq!(entries[2].relationship, Some("Source".to_string()));
}
@ -465,10 +471,7 @@ mod tests {
assert_eq!(entries.len(), relationships.len());
for (idx, entry) in entries.iter().enumerate() {
assert_eq!(
entry.relationship.as_deref(),
Some(relationships[idx])
);
assert_eq!(entry.relationship.as_deref(), Some(relationships[idx]));
}
}
}

View file

@ -9,4 +9,4 @@
pub mod associated_files;
// Re-export key types for convenience
pub use associated_files::{AssociatedFileEntry, walk_af_array};
pub use associated_files::{walk_af_array, AssociatedFileEntry};

View file

@ -129,7 +129,9 @@ pub fn decode(data: &[u8]) -> io::Result<Vec<u8>> {
let mut result = Vec::with_capacity(data.len().min(MAX_DECOMPRESSED_SIZE));
{
let mut decoder = zstd::Decoder::new(data)?;
decoder.take(MAX_DECOMPRESSED_SIZE as u64).read_to_end(&mut result)?;
decoder
.take(MAX_DECOMPRESSED_SIZE as u64)
.read_to_end(&mut result)?;
}
// Check if we hit the bomb limit
@ -466,7 +468,10 @@ mod tests {
let mut result = Vec::with_capacity(SMALL_LIMIT);
{
let decoder = zstd::Decoder::new(&*compressed).unwrap();
decoder.take(SMALL_LIMIT as u64).read_to_end(&mut result).unwrap();
decoder
.take(SMALL_LIMIT as u64)
.read_to_end(&mut result)
.unwrap();
}
// Verify we truncated at the limit

View file

@ -151,9 +151,7 @@ fn canonical_json_value(value: &Value) -> Value {
}
Value::Object(sorted.into_iter().collect())
}
Value::Array(arr) => {
Value::Array(arr.iter().map(canonical_json_value).collect())
}
Value::Array(arr) => Value::Array(arr.iter().map(canonical_json_value).collect()),
// Numbers: preserve integer representation, canonicalize floats
Value::Number(n) => {
if n.is_i64() || n.is_u64() {
@ -253,7 +251,10 @@ mod tests {
let json_str = canonical.to_string();
let ev_pos = json_str.find("extraction_version").unwrap();
let receipts_pos = json_str.find("receipts").unwrap();
assert!(ev_pos < receipts_pos, "Keys should be sorted lexicographically");
assert!(
ev_pos < receipts_pos,
"Keys should be sorted lexicographically"
);
}
#[test]
@ -335,8 +336,8 @@ mod tests {
let key2 = CacheKey::new("fp", &opts);
// Same key should hash the same
use std::hash::{Hash, Hasher};
use std::collections::hash_map::DefaultHasher;
use std::hash::{Hash, Hasher};
let mut h1 = DefaultHasher::new();
key1.hash(&mut h1);
@ -361,8 +362,11 @@ mod tests {
assert!(key.opts_hash.chars().all(|c| c.is_ascii_hexdigit()));
// hex::encode produces lowercase hex (0-9, a-f), verify no uppercase letters
assert!(key.opts_hash.chars().all(|c| !c.is_ascii_uppercase()),
"Hash should be lowercase hex: {}", key.opts_hash);
assert!(
key.opts_hash.chars().all(|c| !c.is_ascii_uppercase()),
"Hash should be lowercase hex: {}",
key.opts_hash
);
}
#[test]
@ -376,8 +380,10 @@ mod tests {
let key1 = CacheKey::new("fp", &opts1);
let key2 = CacheKey::new("fp", &opts2);
assert_eq!(key1.opts_hash, key2.opts_hash,
"Same logical request should produce same key");
assert_eq!(
key1.opts_hash, key2.opts_hash,
"Same logical request should produce same key"
);
}
#[test]
@ -388,8 +394,10 @@ mod tests {
let key_off = CacheKey::new("fp", &opts_off);
let key_lite = CacheKey::new("fp", &opts_lite);
assert_ne!(key_off.opts_hash, key_lite.opts_hash,
"Different logical requests should produce different keys");
assert_ne!(
key_off.opts_hash, key_lite.opts_hash,
"Different logical requests should produce different keys"
);
}
// Acceptance criteria tests for Phase 6.9.2
@ -408,8 +416,10 @@ mod tests {
let key1 = CacheKey::new("fp", &opts1);
let key2 = CacheKey::new("fp", &opts2);
assert_eq!(key1.opts_hash, key2.opts_hash,
"Same effective values should produce same hash");
assert_eq!(
key1.opts_hash, key2.opts_hash,
"Same effective values should produce same hash"
);
}
#[test]
@ -421,8 +431,10 @@ mod tests {
let key_off = CacheKey::new("fp", &opts_off);
let key_lite = CacheKey::new("fp", &opts_lite);
assert_ne!(key_off.opts_hash, key_lite.opts_hash,
"Toggling receipts from off to lite should change hash");
assert_ne!(
key_off.opts_hash, key_lite.opts_hash,
"Toggling receipts from off to lite should change hash"
);
}
#[test]
@ -442,8 +454,10 @@ mod tests {
hex::encode(hash)
};
assert_ne!(key_v1, key_v2,
"Different pdftract version should produce different hash");
assert_ne!(
key_v1, key_v2,
"Different pdftract version should produce different hash"
);
}
#[test]
@ -463,8 +477,10 @@ mod tests {
let canon1 = canonical_json(&val1);
let canon2 = canonical_json(&val2);
assert_eq!(canon1, canon2,
"Different insertion orders should produce same canonical JSON");
assert_eq!(
canon1, canon2,
"Different insertion orders should produce same canonical JSON"
);
// Keys should be sorted
assert!(canon1.contains("\"a\":2"));
@ -489,8 +505,7 @@ mod tests {
let canon1 = canonical_json(&val1);
let canon2 = canonical_json(&val2);
assert_eq!(canon1, canon2,
"0.5 and 0.500 should serialize identically");
assert_eq!(canon1, canon2, "0.5 and 0.500 should serialize identically");
// Both should serialize to 0.5 (shortest representation)
assert!(canon1.contains("\"x\":0.5"));
@ -499,11 +514,7 @@ mod tests {
#[test]
fn test_acceptance_float_canonical_edge_cases() {
// Test various float representations
let test_cases = vec![
(1.0, "1.00"),
(0.1, "0.100"),
(1.5, "1.500"),
];
let test_cases = vec![(1.0, "1.00"), (0.1, "0.100"), (1.5, "1.500")];
for (val1, val2_str) in test_cases {
let mut map1 = Map::new();
@ -519,8 +530,11 @@ mod tests {
let canon1 = canonical_json(&val1_json);
let canon2 = canonical_json(&val2_json);
assert_eq!(canon1, canon2,
"{} and {} should serialize identically", val1, val2_str);
assert_eq!(
canon1, canon2,
"{} and {} should serialize identically",
val1, val2_str
);
}
}
@ -540,8 +554,10 @@ mod tests {
let opts3 = ExtractionOptions::with_receipts(ReceiptsMode::Lite);
let key3 = CacheKey::new("fp", &opts3);
assert_ne!(key1.opts_hash, key3.opts_hash,
"Invariant: same logical request → same key, different request → different key");
assert_ne!(
key1.opts_hash, key3.opts_hash,
"Invariant: same logical request → same key, different request → different key"
);
}
#[test]
@ -562,8 +578,7 @@ mod tests {
let canon1 = canonical_json(&Value::Object(outer1));
let canon2 = canonical_json(&Value::Object(outer2));
assert_eq!(canon1, canon2,
"Nested objects should have sorted keys");
assert_eq!(canon1, canon2, "Nested objects should have sorted keys");
}
#[test]

View file

@ -3,8 +3,8 @@
//! This module implements the two-byte-prefix directory scheme that keeps
//! any single directory under 65K entries even at millions of cached entries.
use std::path::{Path, PathBuf};
use serde::{Deserialize, Serialize};
use std::path::{Path, PathBuf};
/// Current cache schema version.
///
@ -86,7 +86,9 @@ pub fn entry_path(
compressed_size: usize,
) -> PathBuf {
// Strip the "pdftract-v1:" prefix to get the raw hex fingerprint
let fp = fingerprint.strip_prefix(FINGERPRINT_PREFIX).unwrap_or(fingerprint);
let fp = fingerprint
.strip_prefix(FINGERPRINT_PREFIX)
.unwrap_or(fingerprint);
// Validate fingerprint is at least 4 chars (for the two-byte prefixes)
assert!(
@ -121,7 +123,9 @@ pub fn entry_path(
///
/// Path in the format `<cache_dir>/<fp[0:2]>/<fp[2:4]>/<full_fp>`
pub fn fingerprint_dir(cache_dir: &Path, fingerprint: &str) -> PathBuf {
let fp = fingerprint.strip_prefix(FINGERPRINT_PREFIX).unwrap_or(fingerprint);
let fp = fingerprint
.strip_prefix(FINGERPRINT_PREFIX)
.unwrap_or(fingerprint);
assert!(
fp.len() >= 4,
"Fingerprint must be at least 4 characters long, got: {}",
@ -225,7 +229,8 @@ pub fn load_index(cache_dir: &Path) -> Result<Option<CacheIndex>, anyhow::Error>
return Err(anyhow::anyhow!(
"Cache schema version mismatch: expected {}, got {}. \
Please clear the cache with 'pdftract cache clear' and re-populate.",
CURRENT_SCHEMA_VERSION, index.schema_version
CURRENT_SCHEMA_VERSION,
index.schema_version
));
}
@ -297,9 +302,11 @@ mod tests {
use super::*;
use tempfile::TempDir;
const TEST_FINGERPRINT: &str = "pdftract-v1:e7a1f3deadbeef00000000000000000000000000000000000000000000000000";
const TEST_FINGERPRINT: &str =
"pdftract-v1:e7a1f3deadbeef00000000000000000000000000000000000000000000000000";
const TEST_FINGERPRINT_SHORT: &str = "pdftract-v1:e7a1";
const TEST_OPTS_HASH: &str = "9b21c0ffee0000000000000000000000000000000000000000000000000000000";
const TEST_OPTS_HASH: &str =
"9b21c0ffee0000000000000000000000000000000000000000000000000000000";
#[test]
fn test_entry_path_basic() {
@ -333,10 +340,7 @@ mod tests {
assert_eq!(path2.parent(), Some(fp_dir.as_path()));
// But different filenames
assert_ne!(
path1.file_name(),
path2.file_name()
);
assert_ne!(path1.file_name(), path2.file_name());
}
#[test]
@ -354,12 +358,24 @@ mod tests {
// Check via components: skip root + cache, first prefix is e7
let mut components1 = path1.components().skip(2);
let mut components2 = path2.components().skip(2);
assert_eq!(components1.next(), Some(std::path::Component::Normal(std::ffi::OsStr::new("e7"))));
assert_eq!(components2.next(), Some(std::path::Component::Normal(std::ffi::OsStr::new("e7"))));
assert_eq!(
components1.next(),
Some(std::path::Component::Normal(std::ffi::OsStr::new("e7")))
);
assert_eq!(
components2.next(),
Some(std::path::Component::Normal(std::ffi::OsStr::new("e7")))
);
// But different second-level directories
assert_eq!(components1.next(), Some(std::path::Component::Normal(std::ffi::OsStr::new("a1"))));
assert_eq!(components2.next(), Some(std::path::Component::Normal(std::ffi::OsStr::new("b2"))));
assert_eq!(
components1.next(),
Some(std::path::Component::Normal(std::ffi::OsStr::new("a1")))
);
assert_eq!(
components2.next(),
Some(std::path::Component::Normal(std::ffi::OsStr::new("b2")))
);
}
#[test]
@ -367,7 +383,8 @@ mod tests {
let cache_dir = Path::new("/cache");
let fp_dir = fingerprint_dir(cache_dir, TEST_FINGERPRINT);
let expected = "/cache/e7/a1/e7a1f3deadbeef00000000000000000000000000000000000000000000000000";
let expected =
"/cache/e7/a1/e7a1f3deadbeef00000000000000000000000000000000000000000000000000";
assert_eq!(fp_dir, PathBuf::from(expected));
}
@ -378,14 +395,21 @@ mod tests {
// Should use the available chars: e7/a1/e7a1/...
let mut components = path.components().skip(2);
assert_eq!(components.next(), Some(std::path::Component::Normal(std::ffi::OsStr::new("e7"))));
assert_eq!(components.next(), Some(std::path::Component::Normal(std::ffi::OsStr::new("a1"))));
assert_eq!(
components.next(),
Some(std::path::Component::Normal(std::ffi::OsStr::new("e7")))
);
assert_eq!(
components.next(),
Some(std::path::Component::Normal(std::ffi::OsStr::new("a1")))
);
}
#[test]
fn test_parse_opts_hash_from_filename() {
// Valid filename
let filename = "e7a1f3deadbeef00000000000000000000000000000000000000000000000000-12387.json.zst";
let filename =
"e7a1f3deadbeef00000000000000000000000000000000000000000000000000-12387.json.zst";
let opts_hash = parse_opts_hash_from_filename(filename);
assert_eq!(
opts_hash,
@ -404,12 +428,14 @@ mod tests {
#[test]
fn test_parse_size_from_filename() {
let filename = "e7a1f3deadbeef00000000000000000000000000000000000000000000000000-12387.json.zst";
let filename =
"e7a1f3deadbeef00000000000000000000000000000000000000000000000000-12387.json.zst";
let size = parse_size_from_filename(filename);
assert_eq!(size, Some(12387));
// Different size
let filename2 = "e7a1f3deadbeef00000000000000000000000000000000000000000000000000-999.json.zst";
let filename2 =
"e7a1f3deadbeef00000000000000000000000000000000000000000000000000-999.json.zst";
let size2 = parse_size_from_filename(filename2);
assert_eq!(size2, Some(999));
@ -525,7 +551,11 @@ mod tests {
// Convert to string and check length
let path_str = path.to_str().unwrap();
// POSIX max path length is typically 4096
assert!(path_str.len() < 4096, "Path length {} exceeds 4096", path_str.len());
assert!(
path_str.len() < 4096,
"Path length {} exceeds 4096",
path_str.len()
);
// Our paths should be much shorter in practice
// Typical case: /cache + 2 + 2 + 64 + 64 + ~20 = ~154 bytes
@ -554,8 +584,14 @@ mod tests {
// Should still work: /cache/e7/a1/e7a1f3...
let mut components = path.components().skip(2);
assert_eq!(components.next(), Some(std::path::Component::Normal(std::ffi::OsStr::new("e7"))));
assert_eq!(components.next(), Some(std::path::Component::Normal(std::ffi::OsStr::new("a1"))));
assert_eq!(
components.next(),
Some(std::path::Component::Normal(std::ffi::OsStr::new("e7")))
);
assert_eq!(
components.next(),
Some(std::path::Component::Normal(std::ffi::OsStr::new("a1")))
);
}
#[test]

View file

@ -4,7 +4,9 @@
//! file for touch-time tracking. Eviction is triggered on cache writes when
//! the total compressed size exceeds the configured limit (default 1 GiB).
use crate::cache::layout::{entry_path, parse_opts_hash_from_filename, parse_size_from_filename, sentinel_path};
use crate::cache::layout::{
entry_path, parse_opts_hash_from_filename, parse_size_from_filename, sentinel_path,
};
use std::collections::HashMap;
use std::fs::{File, OpenOptions};
use std::io::Write;
@ -138,7 +140,9 @@ impl Lru {
.unwrap_or(0);
// Strip the prefix to match filesystem layout
let fp_normalized = fingerprint.strip_prefix(FINGERPRINT_PREFIX).unwrap_or(fingerprint);
let fp_normalized = fingerprint
.strip_prefix(FINGERPRINT_PREFIX)
.unwrap_or(fingerprint);
// Build the touch record: "<timestamp> <fingerprint>/<opts_hash>\n"
let record = format!("{} {}/{}\n", timestamp, fp_normalized, opts_hash);
@ -220,29 +224,31 @@ impl Lru {
.filter(|e| {
e.path().is_dir()
&& e.file_name().to_string_lossy().len() == 2
&& e.file_name().to_string_lossy().chars().all(|c| c.is_ascii_hexdigit())
&& e.file_name()
.to_string_lossy()
.chars()
.all(|c| c.is_ascii_hexdigit())
})
{
let prefix1_dir = prefix1_entry.path();
// Walk the second-level prefix directories
for prefix2_entry in prefix1_dir.read_dir()?
.filter_map(|e| e.ok())
.filter(|e| {
e.path().is_dir()
&& e.file_name().to_string_lossy().len() == 2
&& e.file_name()
.to_string_lossy()
.chars()
.all(|c| c.is_ascii_hexdigit())
})
{
for prefix2_entry in prefix1_dir.read_dir()?.filter_map(|e| e.ok()).filter(|e| {
e.path().is_dir()
&& e.file_name().to_string_lossy().len() == 2
&& e.file_name()
.to_string_lossy()
.chars()
.all(|c| c.is_ascii_hexdigit())
}) {
let prefix2_dir = prefix2_entry.path();
// Walk the fingerprint directories
for fp_entry in prefix2_dir.read_dir()?.filter_map(|e| e.ok()).filter(|e| {
e.path().is_dir()
}) {
for fp_entry in prefix2_dir
.read_dir()?
.filter_map(|e| e.ok())
.filter(|e| e.path().is_dir())
{
let fp_dir = fp_entry.path();
// Walk the entry files
@ -276,10 +282,8 @@ impl Lru {
// Check if sentinel exists and exceeds rotation threshold
if let Ok(metadata) = sentinel_file.metadata() {
if metadata.len() > SENTINEL_ROTATION_SIZE {
let old_path = sentinel_file.with_extension(&format!(
"touched{}",
SENTINEL_OLD_SUFFIX
));
let old_path =
sentinel_file.with_extension(&format!("touched{}", SENTINEL_OLD_SUFFIX));
// Move current to .old (replace existing .old)
let _ = std::fs::remove_file(&old_path); // Ignore error if doesn't exist
@ -314,27 +318,22 @@ impl Lru {
.filter_map(|e| e.ok())
.filter(|e| {
let name = e.file_name().to_string_lossy().to_string();
e.path().is_dir()
&& name.len() == 2
&& name.chars().all(|c| c.is_ascii_hexdigit())
e.path().is_dir() && name.len() == 2 && name.chars().all(|c| c.is_ascii_hexdigit())
})
{
let prefix1_dir = prefix1_entry.path();
for prefix2_entry in prefix1_dir.read_dir()?
.filter_map(|e| e.ok())
.filter(|e| {
let name = e.file_name().to_string_lossy().to_string();
e.path().is_dir()
&& name.len() == 2
&& name.chars().all(|c| c.is_ascii_hexdigit())
})
{
for prefix2_entry in prefix1_dir.read_dir()?.filter_map(|e| e.ok()).filter(|e| {
let name = e.file_name().to_string_lossy().to_string();
e.path().is_dir() && name.len() == 2 && name.chars().all(|c| c.is_ascii_hexdigit())
}) {
let prefix2_dir = prefix2_entry.path();
for fp_entry in prefix2_dir.read_dir()?.filter_map(|e| e.ok()).filter(|e| {
e.path().is_dir()
}) {
for fp_entry in prefix2_dir
.read_dir()?
.filter_map(|e| e.ok())
.filter(|e| e.path().is_dir())
{
let fp_dir = fp_entry.path();
// Extract fingerprint from path (last component)
@ -347,7 +346,10 @@ impl Lru {
for entry in fp_dir.read_dir()?.filter_map(|e| e.ok()) {
let path = entry.path();
if path.is_file() {
let filename_opt = path.file_name().and_then(|n| n.to_str()).map(|s| s.to_string());
let filename_opt = path
.file_name()
.and_then(|n| n.to_str())
.map(|s| s.to_string());
if let Some(filename) = filename_opt {
if let (Some(opts_hash), Some(size)) = (
parse_opts_hash_from_filename(&filename),
@ -441,10 +443,7 @@ impl Lru {
}
// Read the old sentinel file (.old) if it exists
let old_sentinel = sentinel_file.with_extension(&format!(
"touched{}",
SENTINEL_OLD_SUFFIX
));
let old_sentinel = sentinel_file.with_extension(&format!("touched{}", SENTINEL_OLD_SUFFIX));
if let Ok(contents) = std::fs::read_to_string(&old_sentinel) {
for line in contents.lines().rev() {
let parts: Vec<&str> = line.splitn(2, ' ').collect();
@ -499,27 +498,29 @@ impl Lru {
.filter(|e| {
e.path().is_dir()
&& e.file_name().to_string_lossy().len() == 2
&& e.file_name().to_string_lossy().chars().all(|c| c.is_ascii_hexdigit())
&& e.file_name()
.to_string_lossy()
.chars()
.all(|c| c.is_ascii_hexdigit())
})
{
let prefix1_dir = prefix1_entry.path();
for prefix2_entry in prefix1_dir.read_dir()?
.filter_map(|e| e.ok())
.filter(|e| {
e.path().is_dir()
&& e.file_name().to_string_lossy().len() == 2
&& e.file_name()
.to_string_lossy()
.chars()
.all(|c| c.is_ascii_hexdigit())
})
{
for prefix2_entry in prefix1_dir.read_dir()?.filter_map(|e| e.ok()).filter(|e| {
e.path().is_dir()
&& e.file_name().to_string_lossy().len() == 2
&& e.file_name()
.to_string_lossy()
.chars()
.all(|c| c.is_ascii_hexdigit())
}) {
let prefix2_dir = prefix2_entry.path();
for fp_entry in prefix2_dir.read_dir()?.filter_map(|e| e.ok()).filter(|e| {
e.path().is_dir()
}) {
for fp_entry in prefix2_dir
.read_dir()?
.filter_map(|e| e.ok())
.filter(|e| e.path().is_dir())
{
let fp_dir = fp_entry.path();
// Check if the fingerprint directory is empty
@ -563,7 +564,7 @@ impl Lru {
Err(e) if e.kind() == std::io::ErrorKind::NotFound => {
// Sentinel doesn't exist yet (no entries touched), nothing to truncate
return Ok(());
},
}
Err(e) => return Err(e),
};
let lines: Vec<&str> = contents.lines().collect();
@ -588,10 +589,13 @@ mod tests {
use std::fs;
use tempfile::TempDir;
const TEST_FINGERPRINT: &str = "pdftract-v1:e7a1f3deadbeef00000000000000000000000000000000000000000000000000";
const TEST_FINGERPRINT_2: &str = "pdftract-v1:bbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbb";
const TEST_FINGERPRINT: &str =
"pdftract-v1:e7a1f3deadbeef00000000000000000000000000000000000000000000000000";
const TEST_FINGERPRINT_2: &str =
"pdftract-v1:bbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbb";
const TEST_OPTS_HASH: &str = "9b21c0ffee000000000000000000000000000000000000000000000000000000"; // 64 chars
const TEST_OPTS_HASH_2: &str = "aaaaaaaa00000000000000000000000000000000000000000000000000000000"; // 64 chars
const TEST_OPTS_HASH_2: &str =
"aaaaaaaa00000000000000000000000000000000000000000000000000000000"; // 64 chars
/// Create a test cache entry file.
fn create_test_entry(cache_dir: &Path, fp: &str, opts: &str, size: usize) -> PathBuf {
@ -626,7 +630,9 @@ mod tests {
let contents = fs::read_to_string(&sentinel_file).unwrap();
// Sentinel stores fingerprint without prefix
let fp_normalized = TEST_FINGERPRINT.strip_prefix(FINGERPRINT_PREFIX).unwrap_or(TEST_FINGERPRINT);
let fp_normalized = TEST_FINGERPRINT
.strip_prefix(FINGERPRINT_PREFIX)
.unwrap_or(TEST_FINGERPRINT);
assert!(contents.contains(&format!("{}/{}", fp_normalized, TEST_OPTS_HASH)));
}
@ -655,7 +661,9 @@ mod tests {
assert!(now.saturating_sub(timestamp) < 10);
// Second part should be "fp/opts_hash" (fp without prefix)
let fp_normalized = TEST_FINGERPRINT.strip_prefix(FINGERPRINT_PREFIX).unwrap_or(TEST_FINGERPRINT);
let fp_normalized = TEST_FINGERPRINT
.strip_prefix(FINGERPRINT_PREFIX)
.unwrap_or(TEST_FINGERPRINT);
assert_eq!(parts[1], &format!("{}/{}", fp_normalized, TEST_OPTS_HASH));
}
@ -725,7 +733,10 @@ mod tests {
// Verify touch was written
let sentinel_file = sentinel_path(cache_dir);
let sentinel_contents = fs::read_to_string(&sentinel_file).unwrap();
assert!(sentinel_contents.contains(TEST_OPTS_HASH), "Sentinel should contain opts_hash");
assert!(
sentinel_contents.contains(TEST_OPTS_HASH),
"Sentinel should contain opts_hash"
);
// Trigger eviction
lru.maybe_evict().unwrap();
@ -798,7 +809,11 @@ mod tests {
}
// Should have at least 95 parseable records (allowing for some edge cases)
assert!(parseable_count >= 95, "Expected at least 95 parseable records, got {}", parseable_count);
assert!(
parseable_count >= 95,
"Expected at least 95 parseable records, got {}",
parseable_count
);
}
#[test]
@ -823,7 +838,16 @@ mod tests {
.open(&sentinel_file)
.unwrap();
for _ in 0..5 {
writeln!(file, "{} {}", SystemTime::now().duration_since(UNIX_EPOCH).unwrap().as_secs(), large_data).unwrap();
writeln!(
file,
"{} {}",
SystemTime::now()
.duration_since(UNIX_EPOCH)
.unwrap()
.as_secs(),
large_data
)
.unwrap();
}
}
@ -835,10 +859,7 @@ mod tests {
lru.touch(TEST_FINGERPRINT_2, TEST_OPTS_HASH_2).unwrap();
// Old sentinel should exist
let old_sentinel = sentinel_file.with_extension(&format!(
"touched{}",
SENTINEL_OLD_SUFFIX
));
let old_sentinel = sentinel_file.with_extension(&format!("touched{}", SENTINEL_OLD_SUFFIX));
assert!(old_sentinel.exists());
// New sentinel should be smaller
@ -891,15 +912,31 @@ mod tests {
lru.touch(TEST_FINGERPRINT_2, TEST_OPTS_HASH).unwrap(); // newest
// Build LRU order (use fingerprints without prefix to match filesystem layout)
let fp1 = TEST_FINGERPRINT.strip_prefix(FINGERPRINT_PREFIX).unwrap_or(TEST_FINGERPRINT);
let fp2 = TEST_FINGERPRINT_2.strip_prefix(FINGERPRINT_PREFIX).unwrap_or(TEST_FINGERPRINT_2);
let fp1 = TEST_FINGERPRINT
.strip_prefix(FINGERPRINT_PREFIX)
.unwrap_or(TEST_FINGERPRINT);
let fp2 = TEST_FINGERPRINT_2
.strip_prefix(FINGERPRINT_PREFIX)
.unwrap_or(TEST_FINGERPRINT_2);
let entries = vec![
(fp1.to_string(), TEST_OPTS_HASH.to_string(), 1000,
entry_path(cache_dir, TEST_FINGERPRINT, TEST_OPTS_HASH, 1000)),
(fp1.to_string(), TEST_OPTS_HASH_2.to_string(), 2000,
entry_path(cache_dir, TEST_FINGERPRINT, TEST_OPTS_HASH_2, 2000)),
(fp2.to_string(), TEST_OPTS_HASH.to_string(), 3000,
entry_path(cache_dir, TEST_FINGERPRINT_2, TEST_OPTS_HASH, 3000)),
(
fp1.to_string(),
TEST_OPTS_HASH.to_string(),
1000,
entry_path(cache_dir, TEST_FINGERPRINT, TEST_OPTS_HASH, 1000),
),
(
fp1.to_string(),
TEST_OPTS_HASH_2.to_string(),
2000,
entry_path(cache_dir, TEST_FINGERPRINT, TEST_OPTS_HASH_2, 2000),
),
(
fp2.to_string(),
TEST_OPTS_HASH.to_string(),
3000,
entry_path(cache_dir, TEST_FINGERPRINT_2, TEST_OPTS_HASH, 3000),
),
];
let lru_order = lru.build_lru_order(&entries).unwrap();
@ -1007,14 +1044,16 @@ mod tests {
// Helper to generate valid 64-char hex opts hashes with a counter
// Replace the last 4 chars of the base hash with hex counter
let gen_opts = |i: u32| -> String {
format!("{}{:04x}", &TEST_OPTS_HASH[..60], i)
};
let gen_opts = |i: u32| -> String { format!("{}{:04x}", &TEST_OPTS_HASH[..60], i) };
// Helper to generate valid 64-char hex fingerprints with a counter
// Replace the last 4 chars of the base fingerprint with hex counter
let gen_fp = |i: u32| -> String {
format!("{}{:04x}", &TEST_FINGERPRINT[FINGERPRINT_PREFIX.len()..60], i)
format!(
"{}{:04x}",
&TEST_FINGERPRINT[FINGERPRINT_PREFIX.len()..60],
i
)
};
// Create 1000 entries totaling 100 MB (over limit)
@ -1083,7 +1122,9 @@ mod tests {
// Helper function to get fingerprint dir (copied from layout module)
fn fingerprint_dir(cache_dir: &Path, fingerprint: &str) -> PathBuf {
const FINGERPRINT_PREFIX: &str = "pdftract-v1:";
let fp = fingerprint.strip_prefix(FINGERPRINT_PREFIX).unwrap_or(fingerprint);
let fp = fingerprint
.strip_prefix(FINGERPRINT_PREFIX)
.unwrap_or(fingerprint);
let prefix1 = &fp[0..2.min(fp.len())];
let prefix2 = &fp[2..4.min(fp.len())];
cache_dir.join(prefix1).join(prefix2).join(fp)

View file

@ -22,16 +22,18 @@
//! - [`compression`] — Zstandard compression/decompression for cache entries
//! - [`metadata`] — Cache index.json and metadata handling (TODO: 6.9.3)
pub mod compression;
pub mod key;
pub mod layout;
pub mod compression;
pub mod multi_process;
pub mod lru;
pub mod multi_process;
pub use key::CacheKey;
pub use layout::{entry_path, CacheIndex, CURRENT_SCHEMA_VERSION, increment_hit_counter, increment_miss_counter};
pub use multi_process::{Reader, Writer, cleanup_stale_temp_files};
pub use layout::{
entry_path, increment_hit_counter, increment_miss_counter, CacheIndex, CURRENT_SCHEMA_VERSION,
};
pub use lru::Lru;
pub use multi_process::{cleanup_stale_temp_files, Reader, Writer};
use crate::extract::ExtractionResult;
use crate::options::ExtractionOptions;
@ -44,7 +46,10 @@ use std::time::{SystemTime, UNIX_EPOCH};
#[derive(Debug)]
pub enum CacheLookupResult {
/// Cache hit: entry found and deserialized successfully
Hit { result: ExtractionResult, age_seconds: u64 },
Hit {
result: ExtractionResult,
age_seconds: u64,
},
/// Cache miss: entry not found or corrupt (will be overwritten)
Miss,
/// Cache skipped: cache not configured or disabled
@ -126,7 +131,10 @@ pub fn extract_with_cache(
Ok(result) => {
// Cache hit - increment counter and touch the entry
let _ = increment_hit_counter(cache_dir);
let lru = Lru::new(cache_dir, cache_size_bytes.unwrap_or(lru::DEFAULT_CACHE_SIZE_BYTES));
let lru = Lru::new(
cache_dir,
cache_size_bytes.unwrap_or(lru::DEFAULT_CACHE_SIZE_BYTES),
);
let _ = lru.touch(&fingerprint, &key.opts_hash);
return Ok((result, "hit".to_string(), Some(age_seconds)));
}
@ -154,7 +162,8 @@ pub fn extract_with_cache(
match compression::encode(&json_data) {
Ok(compressed) => {
let writer = Writer::new(cache_dir);
let _ = writer.write(&fingerprint, &key.opts_hash, compressed.len(), &compressed);
let _ =
writer.write(&fingerprint, &key.opts_hash, compressed.len(), &compressed);
// Update index entry count and total bytes
if let Ok(mut index) = layout::load_index(cache_dir) {
@ -165,7 +174,10 @@ pub fn extract_with_cache(
}
// Trigger LRU eviction if needed
let lru = Lru::new(cache_dir, cache_size_bytes.unwrap_or(lru::DEFAULT_CACHE_SIZE_BYTES));
let lru = Lru::new(
cache_dir,
cache_size_bytes.unwrap_or(lru::DEFAULT_CACHE_SIZE_BYTES),
);
let _ = lru.maybe_evict();
}
Err(_) => {

View file

@ -373,14 +373,14 @@ pub fn cleanup_stale_temp_files(cache_dir: &Path) -> io::Result<()> {
let _cleaned = 0;
// Walk the two-byte prefix directories
for prefix1_entry in fs::read_dir(cache_dir)?
.filter_map(|e| e.ok())
.filter(|e| {
e.path().is_dir()
&& e.file_name().to_string_lossy().len() == 2
&& e.file_name().to_string_lossy().chars().all(|c| c.is_ascii_hexdigit())
})
{
for prefix1_entry in fs::read_dir(cache_dir)?.filter_map(|e| e.ok()).filter(|e| {
e.path().is_dir()
&& e.file_name().to_string_lossy().len() == 2
&& e.file_name()
.to_string_lossy()
.chars()
.all(|c| c.is_ascii_hexdigit())
}) {
let prefix1_dir = prefix1_entry.path();
// Walk the second-level prefix directories
@ -391,14 +391,15 @@ pub fn cleanup_stale_temp_files(cache_dir: &Path) -> io::Result<()> {
.to_string_lossy()
.chars()
.all(|c| c.is_ascii_hexdigit())
})
{
}) {
let prefix2_dir = prefix2_entry.path();
// Walk the fingerprint directories
for fp_entry in prefix2_dir.read_dir()?.filter_map(|e| e.ok()).filter(|e| {
e.path().is_dir()
}) {
for fp_entry in prefix2_dir
.read_dir()?
.filter_map(|e| e.ok())
.filter(|e| e.path().is_dir())
{
let fp_dir = fp_entry.path();
// Walk the entry files
@ -413,7 +414,8 @@ pub fn cleanup_stale_temp_files(cache_dir: &Path) -> io::Result<()> {
if let Ok(metadata) = path.metadata() {
if let Ok(modified) = metadata.modified() {
if let Ok(duration) = modified.duration_since(UNIX_EPOCH) {
let age_seconds = now.saturating_sub(duration.as_secs());
let age_seconds =
now.saturating_sub(duration.as_secs());
if age_seconds > TEMP_FILE_MAX_AGE_SECONDS {
// Delete stale temp file
@ -441,7 +443,8 @@ mod tests {
use std::time::Duration;
use tempfile::TempDir;
const TEST_FINGERPRINT: &str = "pdftract-v1:e7a1f3deadbeef00000000000000000000000000000000000000000000000000";
const TEST_FINGERPRINT: &str =
"pdftract-v1:e7a1f3deadbeef00000000000000000000000000000000000000000000000000";
const TEST_OPTS_HASH: &str = "9b21c0ffee000000000000000000000000000000000000000000000000000000";
const TEST_DATA: &[u8] = b"test cache entry data";
@ -458,12 +461,19 @@ mod tests {
let compressed = compress_data(TEST_DATA);
writer
.write(TEST_FINGERPRINT, TEST_OPTS_HASH, compressed.len(), &compressed)
.write(
TEST_FINGERPRINT,
TEST_OPTS_HASH,
compressed.len(),
&compressed,
)
.unwrap();
// Verify the entry exists
let reader = Reader::new(cache_dir);
let result = reader.read(TEST_FINGERPRINT, TEST_OPTS_HASH, compressed.len()).unwrap();
let result = reader
.read(TEST_FINGERPRINT, TEST_OPTS_HASH, compressed.len())
.unwrap();
assert_eq!(result, TEST_DATA);
}
@ -493,7 +503,12 @@ mod tests {
// Write entry
writer
.write(TEST_FINGERPRINT, TEST_OPTS_HASH, compressed.len(), &compressed)
.write(
TEST_FINGERPRINT,
TEST_OPTS_HASH,
compressed.len(),
&compressed,
)
.unwrap();
// Now it exists
@ -509,12 +524,22 @@ mod tests {
let compressed = compress_data(TEST_DATA);
// Parent directories don't exist yet
let entry = entry_path(cache_dir, TEST_FINGERPRINT, TEST_OPTS_HASH, compressed.len());
let entry = entry_path(
cache_dir,
TEST_FINGERPRINT,
TEST_OPTS_HASH,
compressed.len(),
);
assert!(!entry.exists());
// Write should create parent directories
writer
.write(TEST_FINGERPRINT, TEST_OPTS_HASH, compressed.len(), &compressed)
.write(
TEST_FINGERPRINT,
TEST_OPTS_HASH,
compressed.len(),
&compressed,
)
.unwrap();
assert!(entry.exists());
@ -535,19 +560,32 @@ mod tests {
let handle1 = thread::spawn(move || {
let writer = Writer::new(&cache_dir1);
writer.write(TEST_FINGERPRINT, TEST_OPTS_HASH, compressed_size, &compressed1)
writer.write(
TEST_FINGERPRINT,
TEST_OPTS_HASH,
compressed_size,
&compressed1,
)
});
let handle2 = thread::spawn(move || {
let writer = Writer::new(&cache_dir2);
writer.write(TEST_FINGERPRINT, TEST_OPTS_HASH, compressed_size, &compressed2)
writer.write(
TEST_FINGERPRINT,
TEST_OPTS_HASH,
compressed_size,
&compressed2,
)
});
// Both should succeed (no deadlock)
let result1 = handle1.join().unwrap();
let result2 = handle2.join().unwrap();
assert!(result1.is_ok() || result2.is_ok(), "At least one writer should succeed");
assert!(
result1.is_ok() || result2.is_ok(),
"At least one writer should succeed"
);
// The final entry should be valid (one of the two)
let reader = Reader::new(&cache_dir);
@ -594,9 +632,9 @@ mod tests {
// Need to find the actual compressed size
let entry_path_buf = entry_path(&cache_dir, &fp, &opts, 0);
let entry_dir = entry_path_buf.parent().unwrap();
let _found = fs::read_dir(entry_dir).unwrap().any(|e| {
e.ok().filter(|f| f.path().is_file()).is_some()
});
let _found = fs::read_dir(entry_dir)
.unwrap()
.any(|e| e.ok().filter(|f| f.path().is_file()).is_some());
assert!(_found, "Entry {} should exist", i);
}
@ -612,10 +650,20 @@ mod tests {
// Write a valid entry
writer
.write(TEST_FINGERPRINT, TEST_OPTS_HASH, compressed.len(), &compressed)
.write(
TEST_FINGERPRINT,
TEST_OPTS_HASH,
compressed.len(),
&compressed,
)
.unwrap();
let entry = entry_path(cache_dir, TEST_FINGERPRINT, TEST_OPTS_HASH, compressed.len());
let entry = entry_path(
cache_dir,
TEST_FINGERPRINT,
TEST_OPTS_HASH,
compressed.len(),
);
// Corrupt the entry by truncating it
{
@ -647,7 +695,12 @@ mod tests {
let compressed = compress_data(TEST_DATA);
// Create a temp file manually
let entry = entry_path(cache_dir, TEST_FINGERPRINT, TEST_OPTS_HASH, compressed.len());
let entry = entry_path(
cache_dir,
TEST_FINGERPRINT,
TEST_OPTS_HASH,
compressed.len(),
);
let temp_path = writer.temp_path(&entry);
// Create parent directory first
@ -678,7 +731,12 @@ mod tests {
let compressed = compress_data(TEST_DATA);
// Create a recent temp file
let entry = entry_path(cache_dir, TEST_FINGERPRINT, TEST_OPTS_HASH, compressed.len());
let entry = entry_path(
cache_dir,
TEST_FINGERPRINT,
TEST_OPTS_HASH,
compressed.len(),
);
let temp_path = writer.temp_path(&entry);
// Create parent directory first
@ -723,7 +781,12 @@ mod tests {
let writer = Writer::new(cache_dir);
let compressed = compress_data(TEST_DATA);
let entry = entry_path(cache_dir, TEST_FINGERPRINT, TEST_OPTS_HASH, compressed.len());
let entry = entry_path(
cache_dir,
TEST_FINGERPRINT,
TEST_OPTS_HASH,
compressed.len(),
);
// Generate multiple temp paths
let path1 = writer.temp_path(&entry);
@ -754,7 +817,12 @@ mod tests {
// This should work normally
writer
.write(TEST_FINGERPRINT, TEST_OPTS_HASH, compressed.len(), &compressed)
.write(
TEST_FINGERPRINT,
TEST_OPTS_HASH,
compressed.len(),
&compressed,
)
.unwrap();
// Verify the entry exists
@ -838,7 +906,8 @@ mod tests {
thread::spawn(move || {
for iter in 0..NUM_ITERATIONS {
for (key_idx, (fp, opts)) in keys.iter().enumerate() {
let data = format!("process {} iteration {} key {}", proc_id, iter, key_idx);
let data =
format!("process {} iteration {} key {}", proc_id, iter, key_idx);
let compressed = compress_data(data.as_bytes());
let size = compressed.len();
@ -871,9 +940,9 @@ mod tests {
let entry_path_buf = entry_path(&cache_dir, fp, opts, 0);
let fp_dir = entry_path_buf.parent().unwrap();
if fp_dir.exists() {
let _found = fs::read_dir(fp_dir).unwrap().any(|e| {
e.ok().filter(|f| f.path().is_file()).is_some()
});
let _found = fs::read_dir(fp_dir)
.unwrap()
.any(|e| e.ok().filter(|f| f.path().is_file()).is_some());
// At least one entry should exist for this key
// (may have multiple versions due to concurrent writes)
}
@ -923,12 +992,22 @@ mod tests {
let handle1 = thread::spawn(move || {
let writer = Writer::new(&cache_dir1);
writer.write(TEST_FINGERPRINT, TEST_OPTS_HASH, compressed_size, &compressed1)
writer.write(
TEST_FINGERPRINT,
TEST_OPTS_HASH,
compressed_size,
&compressed1,
)
});
let handle2 = thread::spawn(move || {
let writer = Writer::new(&cache_dir2);
writer.write(TEST_FINGERPRINT, TEST_OPTS_HASH, compressed_size, &compressed2)
writer.write(
TEST_FINGERPRINT,
TEST_OPTS_HASH,
compressed_size,
&compressed2,
)
});
// Both should succeed without deadlock
@ -941,7 +1020,10 @@ mod tests {
// Final entry should be valid
let reader = Reader::new(&cache_dir);
let result = reader.read(TEST_FINGERPRINT, TEST_OPTS_HASH, compressed_size);
assert!(result.is_ok(), "Entry should be readable after concurrent writes");
assert!(
result.is_ok(),
"Entry should be readable after concurrent writes"
);
}
#[test]
@ -960,7 +1042,12 @@ mod tests {
let compressed = compressed.clone();
thread::spawn(move || {
let writer = Writer::new(&cache_dir);
writer.write(TEST_FINGERPRINT, TEST_OPTS_HASH, compressed_size, &compressed)
writer.write(
TEST_FINGERPRINT,
TEST_OPTS_HASH,
compressed_size,
&compressed,
)
})
})
.collect();
@ -1006,11 +1093,21 @@ mod tests {
let compressed = compress_data(TEST_DATA);
writer
.write(TEST_FINGERPRINT, TEST_OPTS_HASH, compressed.len(), &compressed)
.write(
TEST_FINGERPRINT,
TEST_OPTS_HASH,
compressed.len(),
&compressed,
)
.unwrap();
// Corrupt the entry
let entry = entry_path(cache_dir, TEST_FINGERPRINT, TEST_OPTS_HASH, compressed.len());
let entry = entry_path(
cache_dir,
TEST_FINGERPRINT,
TEST_OPTS_HASH,
compressed.len(),
);
fs::write(&entry, b"corrupted data").unwrap();
// Read should detect corruption, delete entry, and return error

View file

@ -25,8 +25,8 @@
//! 4. After all signals run: tally votes weighted by strength; pick highest-weight class
//! 5. If no signal voted, default to Vector with confidence 0.5
use std::collections::BTreeSet;
use serde::{Deserialize, Serialize};
use std::collections::BTreeSet;
/// Page context containing all metrics needed for classification.
///
@ -360,7 +360,8 @@ impl PageClassifier {
}
// Weight each class by sum of strengths
let mut class_weights: std::collections::HashMap<PageClass, f32> = std::collections::HashMap::new();
let mut class_weights: std::collections::HashMap<PageClass, f32> =
std::collections::HashMap::new();
let mut total_weight = 0.0;
for vote in &votes {
@ -960,7 +961,10 @@ mod tests {
set2.insert(2);
// Iteration order should be the same
assert_eq!(set1.iter().collect::<Vec<_>>(), set2.iter().collect::<Vec<_>>());
assert_eq!(
set1.iter().collect::<Vec<_>>(),
set2.iter().collect::<Vec<_>>()
);
}
#[test]
@ -1022,9 +1026,12 @@ mod tests {
// Verify all scanned cells are from rows 2-7 only
for flat in scanned_cells {
let cell = CellIndex::from_flat(*flat);
assert!(cell.row >= 2 && cell.row <= 7,
assert!(
cell.row >= 2 && cell.row <= 7,
"scanned cell at flat {} should be in rows 2-7, got row {}",
flat, cell.row);
flat,
cell.row
);
}
}
@ -1432,7 +1439,10 @@ mod tests {
assert_eq!(result1.class, result2.class);
assert_eq!(result1.confidence, result2.confidence);
assert_eq!(result1.hybrid_cells.is_some(), result2.hybrid_cells.is_some());
assert_eq!(
result1.hybrid_cells.is_some(),
result2.hybrid_cells.is_some()
);
}
#[test]
@ -1440,9 +1450,9 @@ mod tests {
// Verify all confidence values are in [0.0, 1.0]
let test_cases = vec![
// (text_ops, raw_chars, valid_chars, image_cov, density)
(0, 0, 0, 0.0, 0.0), // blank
(0, 0, 0, 0.95, 0.0), // scanned
(100, 1000, 100, 0.1, 0.1), // low validity
(0, 0, 0, 0.0, 0.0), // blank
(0, 0, 0, 0.95, 0.0), // scanned
(100, 1000, 100, 0.1, 0.1), // low validity
(500, 3000, 2900, 0.0, 0.9), // high validity vector
(200, 1500, 1400, 0.7, 0.5), // ambiguous
];
@ -1459,7 +1469,12 @@ mod tests {
assert!(
result.confidence >= 0.0 && result.confidence <= 1.0,
"confidence {} out of range for case ({}, {}, {}, {}, {})",
result.confidence, text_ops, raw, valid, img_cov, density
result.confidence,
text_ops,
raw,
valid,
img_cov,
density
);
}
}
@ -1585,9 +1600,17 @@ mod tests {
grid_cells: Some(std::array::from_fn(|i| {
let row = i / 8;
if row < 2 {
CellData { text_op_count: 15, image_coverage: 0.05, char_validity: 0.95 }
CellData {
text_op_count: 15,
image_coverage: 0.05,
char_validity: 0.95,
}
} else {
CellData { text_op_count: 0, image_coverage: 0.90, char_validity: 0.0 }
CellData {
text_op_count: 0,
image_coverage: 0.90,
char_validity: 0.0,
}
}
})),
},

View file

@ -673,8 +673,14 @@ mod tests {
// Verify both modes complete successfully
// The actual 10% speedup comes from skipping ToUnicode lookup
// which is implemented in the process_string function
assert!(normal_duration.as_nanos() > 0, "Normal mode should complete");
assert!(hint_duration.as_nanos() > 0, "PositionHint mode should complete");
assert!(
normal_duration.as_nanos() > 0,
"Normal mode should complete"
);
assert!(
hint_duration.as_nanos() > 0,
"PositionHint mode should complete"
);
// In practice, PositionHint is faster because it skips ToUnicode lookup.
// This test verifies the code paths work correctly; for actual

View file

@ -9,14 +9,16 @@
//! `PageIter` which yields pages lazily without materializing the entire page tree.
//! Use `PdfExtractor::pages()` to get an iterator that extracts each page on-demand.
use crate::fingerprint::{CatalogFlags, ContentStreamData, FingerprintInput, PageFingerprintData, compute_fingerprint};
use crate::fingerprint::{
compute_fingerprint, CatalogFlags, ContentStreamData, FingerprintInput, PageFingerprintData,
};
use crate::parser::catalog::{parse_catalog, Catalog};
use crate::parser::pages::{flatten_page_tree, PageDict, LazyPageIter};
use crate::parser::pages::{flatten_page_tree, LazyPageIter, PageDict};
use crate::parser::stream::{FileSource, PdfSource};
use crate::parser::xref::{XrefResolver, load_xref_with_prev_chain, XrefSection};
use crate::parser::xref::{load_xref_with_prev_chain, XrefResolver, XrefSection};
use crate::receipts::verifier::SpanData;
use anyhow::{Context, Result, anyhow};
use serde::{Serialize, Deserialize};
use anyhow::{anyhow, Context, Result};
use serde::{Deserialize, Serialize};
use std::path::Path;
/// Parse a PDF file and return the document components needed for verification.
@ -35,14 +37,19 @@ use std::path::Path;
/// # Returns
///
/// A tuple of (fingerprint, catalog, pages, resolver)
pub fn parse_pdf_file(pdf_path: &std::path::Path) -> Result<(String, Catalog, Vec<crate::parser::pages::PageDict>, XrefResolver)> {
pub fn parse_pdf_file(
pdf_path: &std::path::Path,
) -> Result<(
String,
Catalog,
Vec<crate::parser::pages::PageDict>,
XrefResolver,
)> {
// Open the PDF file
let source = FileSource::open(pdf_path)
.context("Failed to open PDF file")?;
let source = FileSource::open(pdf_path).context("Failed to open PDF file")?;
// Find the startxref offset
let startxref_offset = find_startxref(&source)
.context("Failed to find startxref offset")?;
let startxref_offset = find_startxref(&source).context("Failed to find startxref offset")?;
// Load the xref table
let xref_section = load_xref_with_prev_chain(&source, startxref_offset);
@ -51,29 +58,30 @@ pub fn parse_pdf_file(pdf_path: &std::path::Path) -> Result<(String, Catalog, Ve
let resolver = XrefResolver::from_section(xref_section.clone());
// Get the root reference from trailer
let root_ref = xref_section.trailer
let root_ref = xref_section
.trailer
.as_ref()
.and_then(|trailer| trailer.get("Root"))
.and_then(|obj| obj.as_ref())
.ok_or_else(|| anyhow!("No /Root reference in trailer"))?;
// Parse the catalog
let catalog = parse_catalog(&resolver, root_ref)
.map_err(|diagnostics| {
let msg = diagnostics.first()
.map(|d| d.message.as_ref())
.unwrap_or("unknown error");
anyhow!("Failed to parse catalog: {}", msg)
})?;
let catalog = parse_catalog(&resolver, root_ref).map_err(|diagnostics| {
let msg = diagnostics
.first()
.map(|d| d.message.as_ref())
.unwrap_or("unknown error");
anyhow!("Failed to parse catalog: {}", msg)
})?;
// Flatten the page tree
let pages = flatten_page_tree(&resolver, catalog.pages_ref)
.map_err(|diagnostics| {
let msg = diagnostics.first()
.map(|d| d.message.as_ref())
.unwrap_or("unknown error");
anyhow!("Failed to flatten page tree: {}", msg)
})?;
let pages = flatten_page_tree(&resolver, catalog.pages_ref).map_err(|diagnostics| {
let msg = diagnostics
.first()
.map(|d| d.message.as_ref())
.unwrap_or("unknown error");
anyhow!("Failed to flatten page tree: {}", msg)
})?;
// Build fingerprint input
let fingerprint_input = build_fingerprint_input(&catalog, &pages, &xref_section);
@ -92,11 +100,13 @@ fn find_startxref(source: &dyn PdfSource) -> Result<u64> {
let scan_start = len.saturating_sub(1024);
let scan_end = len;
let tail_data = source.read_at(scan_start as u64, scan_end - scan_start)
let tail_data = source
.read_at(scan_start as u64, scan_end - scan_start)
.context("Failed to read PDF tail")?;
// Find "startxref" in the tail data
let startxref_pos = tail_data.windows(9)
let startxref_pos = tail_data
.windows(9)
.rposition(|w| w == b"startxref")
.ok_or_else(|| anyhow!("startxref not found in PDF"))?;
@ -105,21 +115,25 @@ fn find_startxref(source: &dyn PdfSource) -> Result<u64> {
let offset_data = &tail_data[startxref_pos + 9..];
// Skip leading whitespace (space, \r, \n, \t)
let offset_start = offset_data.iter()
let offset_start = offset_data
.iter()
.position(|&b| !matches!(b, b' ' | b'\r' | b'\n' | b'\t'))
.unwrap_or(offset_data.len());
let offset_data_trimmed = &offset_data[offset_start..];
// Find the newline after the offset
let newline_pos = offset_data_trimmed.iter()
let newline_pos = offset_data_trimmed
.iter()
.position(|&b| b == b'\n' || b == b'\r')
.unwrap_or(offset_data_trimmed.len());
let offset_str = std::str::from_utf8(&offset_data_trimmed[..newline_pos])
.context("startxref offset is not valid UTF-8")?;
let offset: u64 = offset_str.trim().parse()
let offset: u64 = offset_str
.trim()
.parse()
.context("startxref offset is not a valid number")?;
Ok(offset)
@ -133,24 +147,31 @@ fn build_fingerprint_input(
) -> FingerprintInput {
let page_count = pages.len() as u32;
let fingerprint_pages = pages.iter().map(|page| {
PageFingerprintData {
content_streams: page.contents.iter()
.map(|&obj_ref| ContentStreamData::Indirect(obj_ref))
.collect(),
resources: None, // TODO: convert ResourceDict to PdfDict
media_box: page.media_box,
crop_box: page.crop_box,
rotate: page.rotate,
}
}).collect();
let fingerprint_pages = pages
.iter()
.map(|page| {
PageFingerprintData {
content_streams: page
.contents
.iter()
.map(|&obj_ref| ContentStreamData::Indirect(obj_ref))
.collect(),
resources: None, // TODO: convert ResourceDict to PdfDict
media_box: page.media_box,
crop_box: page.crop_box,
rotate: page.rotate,
}
})
.collect();
// Build catalog flags
let catalog_flags = CatalogFlags {
is_encrypted: false, // TODO: detect encryption
contains_javascript: catalog.open_action.is_some() || catalog.aa.is_some(),
contains_xfa: false, // TODO: detect XFA
ocg_present: catalog.oc_properties.as_ref()
ocg_present: catalog
.oc_properties
.as_ref()
.map(|props| props.present)
.unwrap_or(false),
};
@ -186,8 +207,11 @@ pub fn extract_spans_from_page(
// Check page index bounds
if page_index >= pages.len() {
return Err(anyhow!("Page index {} out of bounds (document has {} pages)",
page_index, pages.len()));
return Err(anyhow!(
"Page index {} out of bounds (document has {} pages)",
page_index,
pages.len()
));
}
let page = &pages[page_index];
@ -260,12 +284,11 @@ impl PdfExtractor {
let path = pdf_path.as_ref();
// Open the PDF file
let source = FileSource::open(path)
.context("Failed to open PDF file")?;
let source = FileSource::open(path).context("Failed to open PDF file")?;
// Find the startxref offset
let startxref_offset = find_startxref(&source)
.context("Failed to find startxref offset")?;
let startxref_offset =
find_startxref(&source).context("Failed to find startxref offset")?;
// Load the xref table
let xref_section = load_xref_with_prev_chain(&source, startxref_offset);
@ -274,20 +297,21 @@ impl PdfExtractor {
let resolver = XrefResolver::from_section(xref_section.clone());
// Get the root reference from trailer
let root_ref = xref_section.trailer
let root_ref = xref_section
.trailer
.as_ref()
.and_then(|trailer| trailer.get("Root"))
.and_then(|obj| obj.as_ref())
.ok_or_else(|| anyhow!("No /Root reference in trailer"))?;
// Parse the catalog
let catalog = parse_catalog(&resolver, root_ref)
.map_err(|diagnostics| {
let msg = diagnostics.first()
.map(|d| d.message.as_ref())
.unwrap_or("unknown error");
anyhow!("Failed to parse catalog: {}", msg)
})?;
let catalog = parse_catalog(&resolver, root_ref).map_err(|diagnostics| {
let msg = diagnostics
.first()
.map(|d| d.message.as_ref())
.unwrap_or("unknown error");
anyhow!("Failed to parse catalog: {}", msg)
})?;
// Build fingerprint input (without full page tree for lazy extraction)
let fingerprint = compute_fingerprint_lazy(&catalog, &xref_section);
@ -406,12 +430,17 @@ impl PdfExtractor {
/// This method extracts one page without materializing the entire document.
/// Content streams are decoded and the result is returned.
pub fn extract_page(&self, page_index: usize) -> Result<PageExtraction> {
let pages = self.pages.as_ref()
let pages = self
.pages
.as_ref()
.ok_or_else(|| anyhow!("Pages not materialized. Call materialize_pages() first."))?;
if page_index >= pages.len() {
return Err(anyhow!("Page index {} out of bounds (document has {} pages)",
page_index, pages.len()));
return Err(anyhow!(
"Page index {} out of bounds (document has {} pages)",
page_index,
pages.len()
));
}
let page = &pages[page_index];
@ -489,7 +518,8 @@ impl<'a> Iterator for PageIter<'a> {
match LazyPageIter::new(&self.extractor.resolver, self.extractor.catalog.pages_ref) {
Ok(iter) => self.lazy_iter = Some(iter),
Err(diagnostics) => {
let msg = diagnostics.first()
let msg = diagnostics
.first()
.map(|d| d.message.as_ref())
.unwrap_or("unknown error");
return Some(Err(anyhow!("Failed to create lazy page iterator: {}", msg)));
@ -518,11 +548,16 @@ impl<'a> Iterator for PageIter<'a> {
Some(result)
}
Some(Err(diagnostics)) => {
let msg = diagnostics.first()
let msg = diagnostics
.first()
.map(|d| d.message.as_ref())
.unwrap_or("unknown error");
self.index += 1;
Some(Err(anyhow!("Error extracting page {}: {}", self.index - 1, msg)))
Some(Err(anyhow!(
"Error extracting page {}: {}",
self.index - 1,
msg
)))
}
None => None,
}
@ -547,7 +582,9 @@ pub(crate) fn compute_fingerprint_lazy(catalog: &Catalog, _xref_section: &XrefSe
is_encrypted: false,
contains_javascript: catalog.open_action.is_some() || catalog.aa.is_some(),
contains_xfa: false,
ocg_present: catalog.oc_properties.as_ref()
ocg_present: catalog
.oc_properties
.as_ref()
.map(|props| props.present)
.unwrap_or(false),
},
@ -559,8 +596,8 @@ pub(crate) fn compute_fingerprint_lazy(catalog: &Catalog, _xref_section: &XrefSe
#[cfg(test)]
mod tests {
use super::*;
use std::io::Write;
use std::fs::File;
use std::io::Write;
/// Create a minimal valid PDF for testing.
fn create_minimal_pdf(path: &std::path::Path) -> Result<()> {

View file

@ -21,8 +21,8 @@
//! images are already binary at scan resolution; rendering at 300 DPI throws away
//! no data but wastes ~9x the CPU.
use crate::options::ExtractionOptions;
use crate::classify::PageContext;
use crate::options::ExtractionOptions;
/// PDF 1.x filter name for image streams.
///
@ -206,10 +206,7 @@ fn compute_median_font_size(font_sizes: &[f32]) -> f32 {
}
// Clamp font sizes to reasonable bounds to prevent outliers
let mut clamped: Vec<f32> = font_sizes
.iter()
.map(|&s| s.clamp(4.0, 72.0))
.collect();
let mut clamped: Vec<f32> = font_sizes.iter().map(|&s| s.clamp(4.0, 72.0)).collect();
// Use nth_element for O(n) median selection
let len = clamped.len();
@ -238,8 +235,14 @@ mod tests {
#[test]
fn test_pdf1_filter_from_name() {
assert_eq!(Pdf1Filter::from_name("JBIG2Decode"), Pdf1Filter::Jbig2Decode);
assert_eq!(Pdf1Filter::from_name("/JBIG2Decode"), Pdf1Filter::Jbig2Decode);
assert_eq!(
Pdf1Filter::from_name("JBIG2Decode"),
Pdf1Filter::Jbig2Decode
);
assert_eq!(
Pdf1Filter::from_name("/JBIG2Decode"),
Pdf1Filter::Jbig2Decode
);
assert_eq!(Pdf1Filter::from_name("DCTDecode"), Pdf1Filter::DctDecode);
assert_eq!(Pdf1Filter::from_name("DCT"), Pdf1Filter::DctDecode);
assert_eq!(Pdf1Filter::from_name("Fl"), Pdf1Filter::FlateDecode);
@ -404,8 +407,8 @@ mod tests {
// With 30 footnotes vs 20 body text, median should be in fine-print range
let mut font_sizes: Vec<f32> = (0..30).map(|_| 6.0).collect(); // footnotes
font_sizes.extend((0..20).map(|_| 10.0)); // body text
// Sorted: 30x 6.0, then 20x 10.0 -> median is at index 25 (0-indexed)
// That's the 26th element, which is 6.0
// Sorted: 30x 6.0, then 20x 10.0 -> median is at index 25 (0-indexed)
// That's the 26th element, which is 6.0
let dpi = select_dpi(&page, &filters, Some(&font_sizes), &options);
assert_eq!(dpi, 400);
}

View file

@ -15,7 +15,7 @@
//! - **Resource dicts**: Dictionary keys are sorted lexicographically for
//! deterministic serialization regardless of insertion order
use crate::diagnostics::{Diagnostic, DiagCode};
use crate::diagnostics::{DiagCode, Diagnostic};
use crate::parser::lexer::{Lexer, Token};
use std::collections::BTreeMap;
use std::sync::Arc;
@ -355,10 +355,19 @@ pub fn hash_resource_dict_canonical(resources: Option<&PdfDict>) -> [u8; 32] {
if let Some(resources) = resources {
// Namespaces to iterate in lexical order
let namespaces = ["/Font", "/XObject", "/ExtGState", "/ColorSpace", "/Pattern", "/Shading", "/Properties"];
let mut sorted_namespaces: Vec<_> = namespaces.iter().filter_map(|&ns| {
resources.get(ns).and_then(|v| v.as_dict()).map(|d| (ns, d))
}).collect();
let namespaces = [
"/Font",
"/XObject",
"/ExtGState",
"/ColorSpace",
"/Pattern",
"/Shading",
"/Properties",
];
let mut sorted_namespaces: Vec<_> = namespaces
.iter()
.filter_map(|&ns| resources.get(ns).and_then(|v| v.as_dict()).map(|d| (ns, d)))
.collect();
// Sort namespaces lexicographically (they're already mostly sorted, but ensure)
sorted_namespaces.sort_by_key(|&(ns, _)| ns);
@ -416,7 +425,7 @@ mod tests {
// Test edge cases from plan
assert_eq!(canonicalize_f64(0.00005, &mut diags), 0); // 0.5 rounds to even (0)
// Note: 0.00015 * 10000 = 1.4999... due to float representation, so rounds to 1
// Note: 0.00015 * 10000 = 1.4999... due to float representation, so rounds to 1
assert_eq!(canonicalize_f64(0.00015, &mut diags), 1); // 1.4999... rounds to 1
// Test negative banker's rounding
@ -579,7 +588,10 @@ mod tests {
let hash1 = hash_resource_dict_canonical(Some(&resources1));
let hash2 = hash_resource_dict_canonical(Some(&resources2));
assert_eq!(hash1, hash2, "Resource dict hash should be independent of insertion order");
assert_eq!(
hash1, hash2,
"Resource dict hash should be independent of insertion order"
);
}
#[test]

View file

@ -103,10 +103,18 @@ impl CatalogFlags {
/// Encode the flags into a single byte.
fn encode(&self) -> u8 {
let mut byte = 0u8;
if self.is_encrypted { byte |= 1 << 0; }
if self.contains_javascript { byte |= 1 << 1; }
if self.contains_xfa { byte |= 1 << 2; }
if self.ocg_present { byte |= 1 << 3; }
if self.is_encrypted {
byte |= 1 << 0;
}
if self.contains_javascript {
byte |= 1 << 1;
}
if self.contains_xfa {
byte |= 1 << 2;
}
if self.ocg_present {
byte |= 1 << 3;
}
byte
}
}
@ -193,9 +201,7 @@ fn hash_content_streams(streams: &[ContentStreamData], resolver: &XrefResolver)
_ => Vec::new(),
}
}
ContentStreamData::Direct(bytes) => {
normalize_content_bytes(bytes)
}
ContentStreamData::Direct(bytes) => normalize_content_bytes(bytes),
};
hasher.update(&bytes);
}
@ -409,24 +415,22 @@ fn hash_extgstate(gs_obj: &PdfObject) -> [u8; 32] {
/// - Rotate as 4-byte BE i32
///
/// NaN/Inf values are canonicalized to 0 and emit STRUCT_INVALID_GEOMETRY diagnostics.
fn hash_page_geometry(
media_box: &[f64; 4],
crop_box: Option<&[f64; 4]>,
rotate: i32,
) -> [u8; 32] {
fn hash_page_geometry(media_box: &[f64; 4], crop_box: Option<&[f64; 4]>, rotate: i32) -> [u8; 32] {
let mut hasher = Sha256::new();
let mut diagnostics: Option<Vec<Diagnostic>> = None;
// MediaBox: 4 coordinates, 8 bytes each = 32 bytes
for coord in media_box {
let canonical = crate::fingerprint::canonicalize::canonicalize_f64(*coord, &mut diagnostics);
let canonical =
crate::fingerprint::canonicalize::canonicalize_f64(*coord, &mut diagnostics);
hasher.update(&canonical.to_be_bytes());
}
// CropBox: if present, same format
if let Some(crop) = crop_box {
for coord in crop {
let canonical = crate::fingerprint::canonicalize::canonicalize_f64(*coord, &mut diagnostics);
let canonical =
crate::fingerprint::canonicalize::canonicalize_f64(*coord, &mut diagnostics);
hasher.update(&canonical.to_be_bytes());
}
}
@ -491,11 +495,7 @@ fn hash_structure_tree(struct_ref: ObjRef, resolver: &XrefResolver) -> [u8; 32]
}
/// Recursively hash structure tree elements.
fn hash_structure_elements(
dict: &PdfDict,
hasher: &mut Sha256,
resolver: &XrefResolver,
) {
fn hash_structure_elements(dict: &PdfDict, hasher: &mut Sha256, resolver: &XrefResolver) {
// Extract and hash relevant keys: /S, /Lang, /Alt, /ActualText
let keys_to_hash = ["S", "Lang", "Alt", "ActualText"];
@ -533,7 +533,13 @@ fn hash_structure_elements(
fn serialize_pdf_object_canonical(obj: &PdfObject) -> Vec<u8> {
match obj {
PdfObject::Null => b"null".to_vec(),
PdfObject::Bool(b) => if *b { b"true".to_vec() } else { b"false".to_vec() },
PdfObject::Bool(b) => {
if *b {
b"true".to_vec()
} else {
b"false".to_vec()
}
}
PdfObject::Integer(i) => i.to_string().into_bytes(),
PdfObject::Real(r) => {
// Serialize with consistent precision
@ -578,9 +584,7 @@ fn serialize_pdf_object_canonical(obj: &PdfObject) -> Vec<u8> {
result.extend_from_slice(b" stream");
result
}
PdfObject::Indirect(i) => {
format!("{} {} obj", i.id.object, i.id.generation).into_bytes()
}
PdfObject::Indirect(i) => format!("{} {} obj", i.id.object, i.id.generation).into_bytes(),
}
}
@ -665,7 +669,7 @@ mod tests {
fn test_round_to_fixed_4dp_critical_cases() {
// Test edge cases from plan
assert_eq!(round_to_fixed_4dp(0.00005), 0); // 0.5 rounds to even (0)
// Note: 0.00015 * 10000 = 1.4999... due to float representation, so rounds to 1
// Note: 0.00015 * 10000 = 1.4999... due to float representation, so rounds to 1
assert_eq!(round_to_fixed_4dp(0.00015), 1); // 1.4999... rounds to 1
// Test negative banker's rounding
@ -678,24 +682,42 @@ mod tests {
assert_eq!(serialize_pdf_object_canonical(&PdfObject::Null), b"null");
// Boolean
assert_eq!(serialize_pdf_object_canonical(&PdfObject::Bool(true)), b"true");
assert_eq!(serialize_pdf_object_canonical(&PdfObject::Bool(false)), b"false");
assert_eq!(
serialize_pdf_object_canonical(&PdfObject::Bool(true)),
b"true"
);
assert_eq!(
serialize_pdf_object_canonical(&PdfObject::Bool(false)),
b"false"
);
// Integer
assert_eq!(serialize_pdf_object_canonical(&PdfObject::Integer(42)), b"42");
assert_eq!(
serialize_pdf_object_canonical(&PdfObject::Integer(42)),
b"42"
);
// Real
let real_bytes = serialize_pdf_object_canonical(&PdfObject::Real(3.14159));
assert!(real_bytes.starts_with(b"3.14159"));
// String
assert_eq!(serialize_pdf_object_canonical(&PdfObject::String(Box::new(vec![b'H', b'i']))), b"(Hi)");
assert_eq!(
serialize_pdf_object_canonical(&PdfObject::String(Box::new(vec![b'H', b'i']))),
b"(Hi)"
);
// Escaped string
assert_eq!(serialize_pdf_object_canonical(&PdfObject::String(Box::new(vec![b'(', b')']))), b"(\\(\\))");
assert_eq!(
serialize_pdf_object_canonical(&PdfObject::String(Box::new(vec![b'(', b')']))),
b"(\\(\\))"
);
// Name
assert_eq!(serialize_pdf_object_canonical(&PdfObject::Name(Arc::from("Type"))), b"/Type");
assert_eq!(
serialize_pdf_object_canonical(&PdfObject::Name(Arc::from("Type"))),
b"/Type"
);
// Reference
let ref_obj = PdfObject::Ref(ObjRef::new(42, 0));
@ -830,7 +852,10 @@ mod tests {
let fp1 = compute_fingerprint(&input1, &resolver);
let fp2 = compute_fingerprint(&input2, &resolver);
assert_ne!(fp1, fp2, "Different page counts should produce different fingerprints");
assert_ne!(
fp1, fp2,
"Different page counts should produce different fingerprints"
);
}
#[test]
@ -868,7 +893,10 @@ mod tests {
let fp1 = compute_fingerprint(&input1, &resolver);
let fp2 = compute_fingerprint(&input2, &resolver);
assert_ne!(fp1, fp2, "Different geometry should produce different fingerprints");
assert_ne!(
fp1, fp2,
"Different geometry should produce different fingerprints"
);
}
#[test]
@ -909,7 +937,10 @@ mod tests {
let fp1 = compute_fingerprint(&input1, &resolver);
let fp2 = compute_fingerprint(&input2, &resolver);
assert_ne!(fp1, fp2, "Different catalog flags should produce different fingerprints");
assert_ne!(
fp1, fp2,
"Different catalog flags should produce different fingerprints"
);
}
#[test]
@ -941,7 +972,11 @@ mod tests {
let fingerprint = compute_fingerprint(&input, &resolver);
let regex = Regex::new(r"^pdftract-v1:[0-9a-f]{64}$").unwrap();
assert!(regex.is_match(&fingerprint), "Fingerprint '{}' must match INV-13 format", fingerprint);
assert!(
regex.is_match(&fingerprint),
"Fingerprint '{}' must match INV-13 format",
fingerprint
);
}
#[test]
@ -955,20 +990,26 @@ mod tests {
let resolver = XrefResolver::new();
let input = FingerprintInput {
page_count,
pages: (0..page_count).map(|_| PageFingerprintData {
content_streams: vec![],
resources: None,
media_box: [0.0, 0.0, 612.0, 792.0],
crop_box: None,
rotate: 0,
}).collect(),
pages: (0..page_count)
.map(|_| PageFingerprintData {
content_streams: vec![],
resources: None,
media_box: [0.0, 0.0, 612.0, 792.0],
crop_box: None,
rotate: 0,
})
.collect(),
struct_tree_root_ref: None,
is_tagged: false,
catalog_flags: CatalogFlags::default(),
};
let fingerprint = compute_fingerprint(&input, &resolver);
assert!(regex.is_match(&fingerprint), "Fingerprint '{}' must match INV-13 format", fingerprint);
assert!(
regex.is_match(&fingerprint),
"Fingerprint '{}' must match INV-13 format",
fingerprint
);
}
}
@ -1016,7 +1057,10 @@ mod tests {
let hash1 = hash_resource_dict(Some(&resources1), &resolver);
let hash2 = hash_resource_dict(Some(&resources2), &resolver);
assert_eq!(hash1, hash2, "Resource dict hash should be independent of insertion order");
assert_eq!(
hash1, hash2,
"Resource dict hash should be independent of insertion order"
);
}
#[test]
@ -1029,13 +1073,15 @@ mod tests {
let resolver = XrefResolver::new();
let input = FingerprintInput {
page_count,
pages: (0..page_count).map(|_| PageFingerprintData {
content_streams: vec![],
resources: None,
media_box: [0.0, 0.0, 612.0, 792.0],
crop_box: None,
rotate: 0,
}).collect(),
pages: (0..page_count)
.map(|_| PageFingerprintData {
content_streams: vec![],
resources: None,
media_box: [0.0, 0.0, 612.0, 792.0],
crop_box: None,
rotate: 0,
})
.collect(),
struct_tree_root_ref: None,
is_tagged: false,
catalog_flags: CatalogFlags::default(),
@ -1046,6 +1092,10 @@ mod tests {
let duration = start.elapsed();
// Performance requirement: < 100 ms for 100-page PDF
assert!(duration.as_millis() < 100, "Fingerprint computation for 100-page PDF took {} ms, should be < 100 ms", duration.as_millis());
assert!(
duration.as_millis() < 100,
"Fingerprint computation for 100-page PDF took {} ms, should be < 100 ms",
duration.as_millis()
);
}
}

View file

@ -106,14 +106,18 @@ fn parse_algorithmic(name: &str) -> Option<char> {
if let Some(rest) = name.strip_prefix("uni") {
// uniXXXX - exactly 4 hex digits
if rest.len() == 4 && rest.chars().all(|c| c.is_ascii_hexdigit()) {
return u32::from_str_radix(rest, 16).ok().and_then(|c| char::from_u32(c));
return u32::from_str_radix(rest, 16)
.ok()
.and_then(|c| char::from_u32(c));
}
}
if let Some(rest) = name.strip_prefix('u') {
// uXXXXXX - up to 6 hex digits
if rest.len() <= 6 && rest.chars().all(|c| c.is_ascii_hexdigit()) {
return u32::from_str_radix(rest, 16).ok().and_then(|c| char::from_u32(c));
return u32::from_str_radix(rest, 16)
.ok()
.and_then(|c| char::from_u32(c));
}
}

View file

@ -275,7 +275,7 @@ mod tests {
fn test_malformed_no_panic() {
// Test various malformed inputs that should not panic
let malformed_inputs: Vec<&[u8]> = vec![
&[0xFF], // Invalid lead byte in Shift-JIS
&[0xFF], // Invalid lead byte in Shift-JIS
&[0x80, 0x80], // Invalid sequence in GB18030
&[0xFE, 0xFF], // Invalid in Big5
&[0xFF, 0xFF], // Invalid in EUC-KR

View file

@ -19,7 +19,7 @@
use std::collections::HashMap;
use crate::diagnostics::{Diagnostic, DiagCode};
use crate::diagnostics::{DiagCode, Diagnostic};
use crate::parser::lexer::Lexer;
use crate::parser::lexer::Token;
@ -49,7 +49,9 @@ impl std::fmt::Display for CMapError {
CMapError::UnexpectedToken(msg) => write!(f, "unexpected token: {}", msg),
CMapError::InvalidHexString(msg) => write!(f, "invalid hex string: {}", msg),
CMapError::InvalidRange => write!(f, "invalid range: lo > hi"),
CMapError::ArrayLengthMismatch => write!(f, "bfrange array length does not match range"),
CMapError::ArrayLengthMismatch => {
write!(f, "bfrange array length does not match range")
}
CMapError::MissingKeyword(kw) => write!(f, "missing expected keyword: {}", kw),
CMapError::EmptyCMap => write!(f, "CMap contains no mappings"),
}
@ -686,7 +688,9 @@ mod tests {
assert_eq!(map.len(), 1);
assert!(!diags.is_empty());
assert!(diags.iter().any(|d| d.message.as_ref().contains("odd number of bytes")));
assert!(diags
.iter()
.any(|d| d.message.as_ref().contains("odd number of bytes")));
}
#[test]

View file

@ -6,7 +6,7 @@
use std::sync::Arc;
use crate::diagnostics::{Diagnostic, DiagCode};
use crate::diagnostics::{DiagCode, Diagnostic};
use crate::font::FontKind;
use crate::parser::object::types::{PdfDict, PdfObject};
use crate::parser::stream::{decode_stream, ExtractionOptions};
@ -132,9 +132,7 @@ impl OpenTypeMetrics {
.cmap
.map(|cmap| {
// Try to find a valid Unicode subtable
cmap.subtables
.into_iter()
.any(|st| st.is_unicode())
cmap.subtables.into_iter().any(|st| st.is_unicode())
})
.unwrap_or(false);
@ -159,9 +157,7 @@ impl FontMetrics for OpenTypeMetrics {
let face_ref = self.face.as_face_ref();
// Use Face's built-in glyph_index which handles cmap lookup
face_ref
.glyph_index(ch)
.map(|id| id.0)
face_ref.glyph_index(ch).map(|id| id.0)
}
fn advance(&self, glyph_id: u16) -> Option<u16> {
@ -214,12 +210,11 @@ impl Type1Metrics {
pub fn from_descriptor(descriptor: &PdfDict, font_dict: &PdfDict) -> FontResult<Self> {
// Extract /Widths array from font dict
let widths = match font_dict.get("/Widths") {
Some(PdfObject::Array(arr)) => {
arr.iter()
.filter_map(|obj| obj.as_int())
.map(|i| i as u16)
.collect()
}
Some(PdfObject::Array(arr)) => arr
.iter()
.filter_map(|obj| obj.as_int())
.map(|i| i as u16)
.collect(),
_ => return Err(FontError::InvalidFontData("missing /Widths array".into())),
};
@ -445,18 +440,16 @@ impl EmbeddedFont {
}
}
}
FontKind::Type1 => {
match Type1Metrics::from_descriptor(descriptor, font_dict) {
Ok(t1_metrics) => Arc::new(t1_metrics),
Err(e) => {
diagnostics.push(Diagnostic::with_dynamic_no_offset(
DiagCode::FontParseFailed,
format!("Type1 font load failed: {}", e),
));
Arc::new(Type1Metrics::empty())
}
FontKind::Type1 => match Type1Metrics::from_descriptor(descriptor, font_dict) {
Ok(t1_metrics) => Arc::new(t1_metrics),
Err(e) => {
diagnostics.push(Diagnostic::with_dynamic_no_offset(
DiagCode::FontParseFailed,
format!("Type1 font load failed: {}", e),
));
Arc::new(Type1Metrics::empty())
}
}
},
_ => Arc::new(EmptyFontMetrics),
};
@ -543,12 +536,15 @@ mod tests {
fn test_type1_metrics_from_descriptor() {
// Create a FontDescriptor-like dict
let mut descriptor = PdfDict::new();
descriptor.insert(intern("/FontBBox"), PdfObject::Array(Box::new(vec![
PdfObject::Integer(-100),
PdfObject::Integer(-200),
PdfObject::Integer(1000),
PdfObject::Integer(900),
])));
descriptor.insert(
intern("/FontBBox"),
PdfObject::Array(Box::new(vec![
PdfObject::Integer(-100),
PdfObject::Integer(-200),
PdfObject::Integer(1000),
PdfObject::Integer(900),
])),
);
// Create a font dict with /Widths
let mut font_dict = PdfDict::new();
@ -560,7 +556,10 @@ mod tests {
PdfObject::Integer(700),
])),
);
font_dict.insert(intern("/Encoding"), PdfObject::Name(intern("/WinAnsiEncoding")));
font_dict.insert(
intern("/Encoding"),
PdfObject::Name(intern("/WinAnsiEncoding")),
);
let metrics = Type1Metrics::from_descriptor(&descriptor, &font_dict).unwrap();
@ -625,12 +624,15 @@ mod tests {
fn test_embedded_font_load_from_dict() {
// Create a minimal font dict with FontDescriptor
let mut descriptor = PdfDict::new();
descriptor.insert(intern("/FontBBox"), PdfObject::Array(Box::new(vec![
PdfObject::Integer(-100),
PdfObject::Integer(-200),
PdfObject::Integer(1000),
PdfObject::Integer(900),
])));
descriptor.insert(
intern("/FontBBox"),
PdfObject::Array(Box::new(vec![
PdfObject::Integer(-100),
PdfObject::Integer(-200),
PdfObject::Integer(1000),
PdfObject::Integer(900),
])),
);
// For this test, we'll use a Type1-style descriptor without a stream
// to test the fallback path
@ -679,7 +681,7 @@ mod tests {
// Uncommon characters might not be in the base font
// (This depends on the specific fixture)
let result = metrics.glyph_id_for('\u{1F600}'); // Emoji
// May or may not be present, but shouldn't panic
// May or may not be present, but shouldn't panic
let _ = result;
}
@ -700,16 +702,32 @@ mod tests {
// Test common Latin characters
for ch in "ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789".chars() {
let gid = metrics.glyph_id_for(ch);
assert!(gid.is_some(), "Character '{}' should be mapped in Latin font", ch);
assert!(
gid.is_some(),
"Character '{}' should be mapped in Latin font",
ch
);
// Verify advance width exists for mapped glyphs
let advance = metrics.advance(gid.unwrap());
assert!(advance.is_some(), "Advance should exist for glyph ID {}", gid.unwrap());
assert!(advance.unwrap() > 0, "Advance should be positive for glyph ID {}", gid.unwrap());
assert!(
advance.is_some(),
"Advance should exist for glyph ID {}",
gid.unwrap()
);
assert!(
advance.unwrap() > 0,
"Advance should be positive for glyph ID {}",
gid.unwrap()
);
// Verify bbox exists
let bbox = metrics.bbox(gid.unwrap());
assert!(bbox.is_some(), "Bbox should exist for glyph ID {}", gid.unwrap());
assert!(
bbox.is_some(),
"Bbox should exist for glyph ID {}",
gid.unwrap()
);
}
}
@ -733,7 +751,10 @@ mod tests {
// Verify that advance widths are in font units (less than UPEM for typical glyphs)
let gid_a = metrics.glyph_id_for('A').unwrap();
let advance_a = metrics.advance(gid_a).unwrap();
assert!(advance_a <= upem, "Advance should be in font units (≤ UPEM)");
assert!(
advance_a <= upem,
"Advance should be in font units (≤ UPEM)"
);
}
#[test]
@ -750,7 +771,10 @@ mod tests {
// The error should be InvalidFontData
match result {
Err(FontError::InvalidFontData(msg)) => {
assert!(msg.contains("ttf-parser error"), "Error should mention ttf-parser");
assert!(
msg.contains("ttf-parser error"),
"Error should mention ttf-parser"
);
}
_ => panic!("Expected InvalidFontData error"),
}
@ -782,12 +806,15 @@ mod tests {
// Acceptance criteria: Type1 font program: gracefully wrap with limited
// capability; do not crash on missing CharStrings parser.
let mut descriptor = PdfDict::new();
descriptor.insert(intern("/FontBBox"), PdfObject::Array(Box::new(vec![
PdfObject::Integer(-100),
PdfObject::Integer(-200),
PdfObject::Integer(1000),
PdfObject::Integer(900),
])));
descriptor.insert(
intern("/FontBBox"),
PdfObject::Array(Box::new(vec![
PdfObject::Integer(-100),
PdfObject::Integer(-200),
PdfObject::Integer(1000),
PdfObject::Integer(900),
])),
);
let mut font_dict = PdfDict::new();
font_dict.insert(intern("/Subtype"), PdfObject::Name(intern("/Type1")));
@ -832,19 +859,25 @@ mod tests {
let metrics = OpenTypeMetrics::from_data(font_data, 0).unwrap();
// DejaVuSans has a Unicode cmap
assert!(metrics.has_valid_cmap(), "DejaVuSans should have valid Unicode cmap");
assert!(
metrics.has_valid_cmap(),
"DejaVuSans should have valid Unicode cmap"
);
}
#[test]
fn test_embedded_font_returns_diagnostics() {
// Verify that EmbeddedFont collects and returns diagnostics
let mut descriptor = PdfDict::new();
descriptor.insert(intern("/FontBBox"), PdfObject::Array(Box::new(vec![
PdfObject::Integer(0),
PdfObject::Integer(0),
PdfObject::Integer(1000),
PdfObject::Integer(1000),
])));
descriptor.insert(
intern("/FontBBox"),
PdfObject::Array(Box::new(vec![
PdfObject::Integer(0),
PdfObject::Integer(0),
PdfObject::Integer(1000),
PdfObject::Integer(1000),
])),
);
let mut font_dict = PdfDict::new();
font_dict.insert(intern("/Subtype"), PdfObject::Name(intern("/Type1")));

View file

@ -14,7 +14,7 @@
use std::sync::Arc;
use crate::diagnostics::{DiagCode, Diagnostic};
use crate::parser::object::types::{PdfObject, PdfDict};
use crate::parser::object::types::{PdfDict, PdfObject};
include!(concat!(env!("OUT_DIR"), "/named_encodings.rs"));
@ -135,7 +135,9 @@ pub struct DifferencesOverlay {
impl DifferencesOverlay {
/// Create an empty overlay.
pub fn new() -> Self {
Self { entries: Vec::new() }
Self {
entries: Vec::new(),
}
}
/// Parse a /Differences array into an overlay.
@ -344,7 +346,8 @@ impl FontEncoding {
}
// Fall back to base encoding
self.base.and_then(|enc| enc.glyph_name(code).map(|s| Arc::from(s)))
self.base
.and_then(|enc| enc.glyph_name(code).map(|s| Arc::from(s)))
}
/// Check if this encoding has a differences overlay.
@ -388,15 +391,36 @@ mod tests {
#[test]
fn test_from_name() {
assert_eq!(NamedEncoding::from_name("WinAnsiEncoding"), Some(NamedEncoding::WinAnsi));
assert_eq!(NamedEncoding::from_name("MacRomanEncoding"), Some(NamedEncoding::MacRoman));
assert_eq!(NamedEncoding::from_name("MacExpertEncoding"), Some(NamedEncoding::MacExpert));
assert_eq!(NamedEncoding::from_name("StandardEncoding"), Some(NamedEncoding::Standard));
assert_eq!(NamedEncoding::from_name("SymbolEncoding"), Some(NamedEncoding::Symbol));
assert_eq!(NamedEncoding::from_name("ZapfDingbatsEncoding"), Some(NamedEncoding::ZapfDingbats));
assert_eq!(
NamedEncoding::from_name("WinAnsiEncoding"),
Some(NamedEncoding::WinAnsi)
);
assert_eq!(
NamedEncoding::from_name("MacRomanEncoding"),
Some(NamedEncoding::MacRoman)
);
assert_eq!(
NamedEncoding::from_name("MacExpertEncoding"),
Some(NamedEncoding::MacExpert)
);
assert_eq!(
NamedEncoding::from_name("StandardEncoding"),
Some(NamedEncoding::Standard)
);
assert_eq!(
NamedEncoding::from_name("SymbolEncoding"),
Some(NamedEncoding::Symbol)
);
assert_eq!(
NamedEncoding::from_name("ZapfDingbatsEncoding"),
Some(NamedEncoding::ZapfDingbats)
);
// Test with leading slash
assert_eq!(NamedEncoding::from_name("/WinAnsiEncoding"), Some(NamedEncoding::WinAnsi));
assert_eq!(
NamedEncoding::from_name("/WinAnsiEncoding"),
Some(NamedEncoding::WinAnsi)
);
// Test unknown encoding
assert_eq!(NamedEncoding::from_name("UnknownEncoding"), None);
@ -513,7 +537,10 @@ mod tests {
assert_eq!(overlay.get(255), Some(Arc::from("a")));
assert_eq!(diagnostics.len(), 1);
assert_eq!(diagnostics[0].code, DiagCode::FontEncodingDifferenceOutOfRange);
assert_eq!(
diagnostics[0].code,
DiagCode::FontEncodingDifferenceOutOfRange
);
}
#[test]
@ -529,7 +556,10 @@ mod tests {
assert_eq!(overlay.get(0), Some(Arc::from("a")));
assert_eq!(diagnostics.len(), 1);
assert_eq!(diagnostics[0].code, DiagCode::FontEncodingDifferenceOutOfRange);
assert_eq!(
diagnostics[0].code,
DiagCode::FontEncodingDifferenceOutOfRange
);
}
#[test]
@ -602,7 +632,9 @@ mod tests {
fn test_font_encoding_unknown_glyph_name() {
// Differences can contain arbitrary glyph names not in AGL
let mut differences = DifferencesOverlay::new();
differences.entries.push((0x20, Arc::from("ArbitraryCustomGlyph")));
differences
.entries
.push((0x20, Arc::from("ArbitraryCustomGlyph")));
let enc = FontEncoding {
base: None,
@ -610,7 +642,10 @@ mod tests {
};
// Should return the custom name, not None
assert_eq!(enc.glyph_name_for(0x20), Some(Arc::from("ArbitraryCustomGlyph")));
assert_eq!(
enc.glyph_name_for(0x20),
Some(Arc::from("ArbitraryCustomGlyph"))
);
}
#[test]

View file

@ -56,9 +56,7 @@ impl FontFingerprint {
let mut hasher = Sha256::new();
hasher.update(font_program_bytes);
let hash = hasher.finalize();
Self {
hash: hash.into(),
}
Self { hash: hash.into() }
}
/// Get the underlying hash bytes.
@ -90,10 +88,7 @@ impl FontFingerprint {
///
/// The hash is computed on the first call and cached in an Arc for subsequent
/// calls. Do NOT call this function repeatedly for the same font without caching.
pub fn lookup_font_fingerprint(
font_program_bytes: &[u8],
gid: u16,
) -> Option<char> {
pub fn lookup_font_fingerprint(font_program_bytes: &[u8], gid: u16) -> Option<char> {
// Compute the fingerprint
let fingerprint = FontFingerprint::compute(font_program_bytes);
@ -101,7 +96,8 @@ pub fn lookup_font_fingerprint(
let entries = FONT_FINGERPRINTS.get(fingerprint.as_bytes())?;
// Find the glyph ID in the entries
let codepoint = entries.iter()
let codepoint = entries
.iter()
.find(|(entry_gid, _)| *entry_gid == gid)
.map(|(_, cp)| *cp)?;
@ -146,7 +142,8 @@ impl CachedFingerprint {
}
let entries = FONT_FINGERPRINTS.get(self.fingerprint.as_bytes())?;
let codepoint = entries.iter()
let codepoint = entries
.iter()
.find(|(entry_gid, _)| *entry_gid == gid)
.map(|(_, cp)| *cp)?;
@ -216,7 +213,10 @@ mod tests {
let cached1 = CachedFingerprint::from_font_program(data);
let cached2 = CachedFingerprint::from_font_program(data);
assert_eq!(cached1.fingerprint().as_bytes(), cached2.fingerprint().as_bytes());
assert_eq!(
cached1.fingerprint().as_bytes(),
cached2.fingerprint().as_bytes()
);
assert_eq!(cached1.is_known(), cached2.is_known());
}

View file

@ -40,7 +40,11 @@ pub enum CharacterCollection {
impl PredefinedCMap {
/// Create a new predefined CMap.
const fn new(name: &'static str, is_vertical: bool, collection: Option<CharacterCollection>) -> Self {
const fn new(
name: &'static str,
is_vertical: bool,
collection: Option<CharacterCollection>,
) -> Self {
Self {
name,
is_vertical,
@ -172,20 +176,52 @@ pub fn from_name(name: &str) -> Option<PredefinedCMap> {
"Identity-V" => Some(PredefinedCMap::new("Identity-V", true, None)),
// Adobe-Japan1 (Japanese)
"UniJIS-UTF16-H" => Some(PredefinedCMap::new("UniJIS-UTF16-H", false, Some(CharacterCollection::Japan1))),
"UniJIS-UTF16-V" => Some(PredefinedCMap::new("UniJIS-UTF16-V", true, Some(CharacterCollection::Japan1))),
"UniJIS-UTF16-H" => Some(PredefinedCMap::new(
"UniJIS-UTF16-H",
false,
Some(CharacterCollection::Japan1),
)),
"UniJIS-UTF16-V" => Some(PredefinedCMap::new(
"UniJIS-UTF16-V",
true,
Some(CharacterCollection::Japan1),
)),
// Adobe-GB1 (Simplified Chinese)
"UniGB-UTF16-H" => Some(PredefinedCMap::new("UniGB-UTF16-H", false, Some(CharacterCollection::GB1))),
"UniGB-UTF16-V" => Some(PredefinedCMap::new("UniGB-UTF16-V", true, Some(CharacterCollection::GB1))),
"UniGB-UTF16-H" => Some(PredefinedCMap::new(
"UniGB-UTF16-H",
false,
Some(CharacterCollection::GB1),
)),
"UniGB-UTF16-V" => Some(PredefinedCMap::new(
"UniGB-UTF16-V",
true,
Some(CharacterCollection::GB1),
)),
// Adobe-CNS1 (Traditional Chinese)
"UniCNS-UTF16-H" => Some(PredefinedCMap::new("UniCNS-UTF16-H", false, Some(CharacterCollection::CNS1))),
"UniCNS-UTF16-V" => Some(PredefinedCMap::new("UniCNS-UTF16-V", true, Some(CharacterCollection::CNS1))),
"UniCNS-UTF16-H" => Some(PredefinedCMap::new(
"UniCNS-UTF16-H",
false,
Some(CharacterCollection::CNS1),
)),
"UniCNS-UTF16-V" => Some(PredefinedCMap::new(
"UniCNS-UTF16-V",
true,
Some(CharacterCollection::CNS1),
)),
// Adobe-Korea1 (Korean)
"UniKS-UTF16-H" => Some(PredefinedCMap::new("UniKS-UTF16-H", false, Some(CharacterCollection::Korea1))),
"UniKS-UTF16-V" => Some(PredefinedCMap::new("UniKS-UTF16-V", true, Some(CharacterCollection::Korea1))),
"UniKS-UTF16-H" => Some(PredefinedCMap::new(
"UniKS-UTF16-H",
false,
Some(CharacterCollection::Korea1),
)),
"UniKS-UTF16-V" => Some(PredefinedCMap::new(
"UniKS-UTF16-V",
true,
Some(CharacterCollection::Korea1),
)),
_ => None,
}
@ -318,11 +354,16 @@ mod tests {
fn test_all_predefined_names() {
// Verify all 10 predefined CMap names resolve
let names = [
"Identity-H", "Identity-V",
"UniJIS-UTF16-H", "UniJIS-UTF16-V",
"UniGB-UTF16-H", "UniGB-UTF16-V",
"UniCNS-UTF16-H", "UniCNS-UTF16-V",
"UniKS-UTF16-H", "UniKS-UTF16-V",
"Identity-H",
"Identity-V",
"UniJIS-UTF16-H",
"UniJIS-UTF16-V",
"UniGB-UTF16-H",
"UniGB-UTF16-V",
"UniCNS-UTF16-H",
"UniCNS-UTF16-V",
"UniKS-UTF16-H",
"UniKS-UTF16-V",
];
for name in names {

View file

@ -7,7 +7,7 @@
use std::collections::BTreeMap;
use std::sync::Arc;
use crate::diagnostics::{Diagnostic, DiagCode};
use crate::diagnostics::{DiagCode, Diagnostic};
use crate::font::embedded::{EmbeddedFont, OpenTypeMetrics};
use crate::font::FontKind;
use crate::parser::object::types::{PdfDict, PdfObject};
@ -230,7 +230,13 @@ impl Type0Font {
// Load CIDToGIDMap for CIDFontType2
let cid_to_gid_map = if subtype == FontKind::CIDFontType2 {
Some(Self::load_cid_to_gid_map(cidfont_dict, source, opts, doc_counter, &mut diagnostics)?)
Some(Self::load_cid_to_gid_map(
cidfont_dict,
source,
opts,
doc_counter,
&mut diagnostics,
)?)
} else {
None
};
@ -432,8 +438,12 @@ impl Type0Font {
font_dict.insert(
crate::parser::object::types::intern("/Subtype"),
match subtype {
FontKind::CIDFontType0 => PdfObject::Name(crate::parser::object::types::intern("/CIDFontType0")),
FontKind::CIDFontType2 => PdfObject::Name(crate::parser::object::types::intern("/CIDFontType2")),
FontKind::CIDFontType0 => {
PdfObject::Name(crate::parser::object::types::intern("/CIDFontType0"))
}
FontKind::CIDFontType2 => {
PdfObject::Name(crate::parser::object::types::intern("/CIDFontType2"))
}
_ => return Err(Type0Error::UnsupportedSubtype(format!("{:?}", subtype))),
},
);
@ -716,9 +726,7 @@ mod tests {
font_dict.insert(intern("/BaseFont"), PdfObject::Name(intern("Type0Font")));
font_dict.insert(
intern("/DescendantFonts"),
PdfObject::Array(Box::new(vec![PdfObject::Dict(Box::new(
cidfont_dict,
))])),
PdfObject::Array(Box::new(vec![PdfObject::Dict(Box::new(cidfont_dict))])),
);
let source = MemorySource::new(vec![]);
@ -745,9 +753,7 @@ mod tests {
font_dict.insert(intern("/Subtype"), PdfObject::Name(intern("/Type0")));
font_dict.insert(
intern("/DescendantFonts"),
PdfObject::Array(Box::new(vec![PdfObject::Dict(Box::new(
cidfont_dict,
))])),
PdfObject::Array(Box::new(vec![PdfObject::Dict(Box::new(cidfont_dict))])),
);
let source = MemorySource::new(vec![]);
@ -781,9 +787,7 @@ mod tests {
font_dict.insert(intern("/Subtype"), PdfObject::Name(intern("/Type0")));
font_dict.insert(
intern("/DescendantFonts"),
PdfObject::Array(Box::new(vec![PdfObject::Dict(Box::new(
cidfont_dict,
))])),
PdfObject::Array(Box::new(vec![PdfObject::Dict(Box::new(cidfont_dict))])),
);
let source = MemorySource::new(vec![]);
@ -809,9 +813,7 @@ mod tests {
font_dict.insert(intern("/Subtype"), PdfObject::Name(intern("/Type0")));
font_dict.insert(
intern("/DescendantFonts"),
PdfObject::Array(Box::new(vec![PdfObject::Dict(Box::new(
cidfont_dict,
))])),
PdfObject::Array(Box::new(vec![PdfObject::Dict(Box::new(cidfont_dict))])),
);
let source = MemorySource::new(vec![]);
@ -880,9 +882,7 @@ mod tests {
font_dict.insert(intern("/Subtype"), PdfObject::Name(intern("/Type0")));
font_dict.insert(
intern("/DescendantFonts"),
PdfObject::Array(Box::new(vec![PdfObject::Dict(Box::new(
cidfont_dict,
))])),
PdfObject::Array(Box::new(vec![PdfObject::Dict(Box::new(cidfont_dict))])),
);
let source = MemorySource::new(vec![]);
@ -917,9 +917,7 @@ mod tests {
font_dict.insert(intern("/Subtype"), PdfObject::Name(intern("/Type0")));
font_dict.insert(
intern("/DescendantFonts"),
PdfObject::Array(Box::new(vec![PdfObject::Dict(Box::new(
cidfont_dict,
))])),
PdfObject::Array(Box::new(vec![PdfObject::Dict(Box::new(cidfont_dict))])),
);
let source = MemorySource::new(vec![]);
@ -947,9 +945,7 @@ mod tests {
font_dict.insert(intern("/Subtype"), PdfObject::Name(intern("/Type0")));
font_dict.insert(
intern("/DescendantFonts"),
PdfObject::Array(Box::new(vec![PdfObject::Dict(Box::new(
cidfont_dict,
))])),
PdfObject::Array(Box::new(vec![PdfObject::Dict(Box::new(cidfont_dict))])),
);
let source = MemorySource::new(vec![]);
@ -996,9 +992,7 @@ mod tests {
font_dict.insert(intern("/BaseFont"), PdfObject::Name(intern("Type0Font")));
font_dict.insert(
intern("/DescendantFonts"),
PdfObject::Array(Box::new(vec![PdfObject::Dict(Box::new(
cidfont_dict,
))])),
PdfObject::Array(Box::new(vec![PdfObject::Dict(Box::new(cidfont_dict))])),
);
let opts = ExtractionOptions::default();
@ -1057,9 +1051,7 @@ mod tests {
font_dict.insert(intern("/BaseFont"), PdfObject::Name(intern("Type0Font")));
font_dict.insert(
intern("/DescendantFonts"),
PdfObject::Array(Box::new(vec![PdfObject::Dict(Box::new(
cidfont_dict,
))])),
PdfObject::Array(Box::new(vec![PdfObject::Dict(Box::new(cidfont_dict))])),
);
let opts = ExtractionOptions::default();
@ -1073,7 +1065,9 @@ mod tests {
// Check that the CIDTOGIDMAP_TRUNCATED diagnostic was emitted
let diagnostics = font.diagnostics();
assert!(diagnostics.iter().any(|d| d.code == DiagCode::FontCidtogidmapTruncated));
assert!(diagnostics
.iter()
.any(|d| d.code == DiagCode::FontCidtogidmapTruncated));
// Verify the array has 2 elements (5 bytes / 2 = 2 GIDs, trailing byte discarded)
if let Some(CIDToGIDMap::Array(arr)) = &font.descendant.cid_to_gid_map {

View file

@ -14,7 +14,7 @@
//! x' = a*x + c*y + e
//! y' = b*x + d*y + f
use crate::diagnostics::{Diagnostic, DiagCode};
use crate::diagnostics::{DiagCode, Diagnostic};
/// Maximum depth of graphics state stack (prevents stack overflow).
const MAX_GSTATE_DEPTH: usize = 32;
@ -73,8 +73,12 @@ impl Matrix3x3 {
/// Check if this is the identity matrix.
#[inline]
pub fn is_identity(&self) -> bool {
self.a == 1.0 && self.b == 0.0 && self.c == 0.0 &&
self.d == 1.0 && self.e == 0.0 && self.f == 0.0
self.a == 1.0
&& self.b == 0.0
&& self.c == 0.0
&& self.d == 1.0
&& self.e == 0.0
&& self.f == 0.0
}
/// Multiply this matrix by another (this * other).

View file

@ -22,7 +22,7 @@
//!
//! IoU = area(A ∩ B) / area(A B)
use crate::classify::{CellIndex, PageClassification, PageClass};
use crate::classify::{CellIndex, PageClass, PageClassification};
use image::{GrayImage, ImageBuffer, Luma};
use std::collections::BTreeSet;
@ -42,13 +42,15 @@ pub struct Span {
pub text: String,
}
/// Source of a span - either vector extraction or OCR.
/// Source of a span - either vector extraction, OCR, or assisted OCR.
#[derive(Debug, Clone, Copy, PartialEq, Eq)]
pub enum SpanSource {
/// Text extracted from content stream (Phase 3).
Vector,
/// Text extracted via OCR (Phase 5).
Ocr,
/// Text extracted via assisted OCR with position validation (Phase 5.5).
OcrAssisted,
}
impl Span {
@ -72,6 +74,11 @@ impl Span {
Self::new(bbox, confidence, SpanSource::Ocr, text)
}
/// Create a span with assisted OCR source (position-validated).
pub fn ocr_assisted(bbox: [f64; 4], confidence: f32, text: String) -> Self {
Self::new(bbox, confidence, SpanSource::OcrAssisted, text)
}
/// Get the width of the span's bbox.
#[inline]
pub fn width(&self) -> f64 {
@ -191,11 +198,15 @@ pub fn merge_vector_and_ocr_spans(vector_spans: &[Span], ocr_spans: &[Span]) ->
// Primary sort: Y (top to bottom = descending Y in PDF coordinates)
// Note: In PDF coordinates, Y=0 is at the bottom, so higher Y means higher on page
b_center_y.partial_cmp(&a_center_y).unwrap_or(std::cmp::Ordering::Equal)
b_center_y
.partial_cmp(&a_center_y)
.unwrap_or(std::cmp::Ordering::Equal)
.then_with(|| {
let a_center_x = (a.bbox[0] + a.bbox[2]) / 2.0;
let b_center_x = (b.bbox[0] + b.bbox[2]) / 2.0;
a_center_x.partial_cmp(&b_center_x).unwrap_or(std::cmp::Ordering::Equal)
a_center_x
.partial_cmp(&b_center_x)
.unwrap_or(std::cmp::Ordering::Equal)
})
});
@ -279,11 +290,10 @@ pub fn get_hybrid_cells(classification: &PageClassification) -> Vec<CellIndex> {
}
match &classification.hybrid_cells {
Some(cells) => {
cells.iter()
.map(|&flat| CellIndex::from_flat(flat))
.collect()
}
Some(cells) => cells
.iter()
.map(|&flat| CellIndex::from_flat(flat))
.collect(),
None => Vec::new(),
}
}
@ -323,7 +333,8 @@ pub fn compute_cell_crops(
let cell_width = page_width / 8.0;
let cell_height = page_height / 8.0;
cells.iter()
cells
.iter()
.map(|cell| {
// Cell coordinates in PDF space
// col 0 = left, row 0 = top
@ -357,7 +368,12 @@ pub trait OcrCallback: Send + Sync {
/// # Returns
///
/// A vector of OCR spans found in this cell, or an error if OCR fails.
fn ocr_cell(&self, cell_image: &GrayImage, cell: CellIndex, dpi: u32) -> Result<Vec<Span>, String>;
fn ocr_cell(
&self,
cell_image: &GrayImage,
cell: CellIndex,
dpi: u32,
) -> Result<Vec<Span>, String>;
}
/// Mock OCR callback for testing that tracks call counts.
@ -369,8 +385,14 @@ struct MockOcrCallback {
#[cfg(test)]
impl OcrCallback for MockOcrCallback {
fn ocr_cell(&self, _cell_image: &GrayImage, _cell: CellIndex, _dpi: u32) -> Result<Vec<Span>, String> {
self.call_count.fetch_add(1, std::sync::atomic::Ordering::SeqCst);
fn ocr_cell(
&self,
_cell_image: &GrayImage,
_cell: CellIndex,
_dpi: u32,
) -> Result<Vec<Span>, String> {
self.call_count
.fetch_add(1, std::sync::atomic::Ordering::SeqCst);
Ok(self.output_spans.clone())
}
}
@ -441,13 +463,7 @@ pub fn process_hybrid_page(
// For each hybrid cell: crop and run OCR
for cell in hybrid_cells {
// Crop the cell from the rendered page
let cell_image = crop_cell_from_page(
page_image,
page_width_pt,
page_height_pt,
cell,
dpi,
);
let cell_image = crop_cell_from_page(page_image, page_width_pt, page_height_pt, cell, dpi);
// Run OCR on this cell
match ocr_callback.ocr_cell(&cell_image, cell, dpi) {
@ -510,7 +526,12 @@ mod tests {
#[test]
fn test_span_new() {
let span = Span::new([10.0, 20.0, 50.0, 40.0], 0.9, SpanSource::Vector, "test".to_string());
let span = Span::new(
[10.0, 20.0, 50.0, 40.0],
0.9,
SpanSource::Vector,
"test".to_string(),
);
assert_eq!(span.bbox, [10.0, 20.0, 50.0, 40.0]);
assert_eq!(span.confidence, 0.9);
assert_eq!(span.source, SpanSource::Vector);
@ -541,12 +562,12 @@ mod tests {
#[test]
fn test_merge_no_overlap() {
let vector = vec![
Span::vector([0.0, 0.0, 10.0, 10.0], 0.9, "vector".to_string()),
];
let ocr = vec![
Span::ocr([20.0, 20.0, 30.0, 30.0], 0.8, "ocr".to_string()),
];
let vector = vec![Span::vector(
[0.0, 0.0, 10.0, 10.0],
0.9,
"vector".to_string(),
)];
let ocr = vec![Span::ocr([20.0, 20.0, 30.0, 30.0], 0.8, "ocr".to_string())];
let result = merge_vector_and_ocr_spans(&vector, &ocr);
assert_eq!(result.len(), 2);
@ -555,9 +576,11 @@ mod tests {
#[test]
fn test_merge_iou_06_vector_kept() {
// IoU = 0.6 > 0.5, vector confidence >= 0.5 -> vector kept, OCR dropped
let vector = vec![
Span::vector([0.0, 0.0, 100.0, 100.0], 0.9, "vector text".to_string()),
];
let vector = vec![Span::vector(
[0.0, 0.0, 100.0, 100.0],
0.9,
"vector text".to_string(),
)];
let ocr = vec![
// OCR overlaps by 60%: intersection 60x100, union (10000 + 10000 - 6000) = 14000
// bbox [40, 0, 100, 100] overlaps [0, 0, 100, 100] by 60x100
@ -573,9 +596,11 @@ mod tests {
#[test]
fn test_merge_iou_03_both_kept() {
// IoU = 0.3 < 0.5 -> both kept
let vector = vec![
Span::vector([0.0, 0.0, 100.0, 100.0], 0.9, "vector".to_string()),
];
let vector = vec![Span::vector(
[0.0, 0.0, 100.0, 100.0],
0.9,
"vector".to_string(),
)];
let ocr = vec![
// OCR overlaps by 30%: [70, 0, 100, 100] overlaps [0, 0, 100, 100] by 30x100
Span::ocr([70.0, 0.0, 100.0, 100.0], 0.7, "ocr".to_string()),
@ -591,16 +616,20 @@ mod tests {
#[test]
fn test_merge_iou_06_low_vector_confidence_ocr_kept() {
// IoU = 0.6 > 0.5, but vector confidence < 0.5 -> OCR kept
let vector = vec![
Span::vector([0.0, 0.0, 100.0, 100.0], 0.2, "bad vector".to_string()),
];
let ocr = vec![
Span::ocr([40.0, 0.0, 100.0, 100.0], 0.7, "ocr text".to_string()),
];
let vector = vec![Span::vector(
[0.0, 0.0, 100.0, 100.0],
0.2,
"bad vector".to_string(),
)];
let ocr = vec![Span::ocr(
[40.0, 0.0, 100.0, 100.0],
0.7,
"ocr text".to_string(),
)];
let result = merge_vector_and_ocr_spans(&vector, &ocr);
assert_eq!(result.len(), 2); // Both kept because vector confidence is low
// Verify both are present
// Verify both are present
assert!(result.iter().any(|s| s.source == SpanSource::Vector));
assert!(result.iter().any(|s| s.source == SpanSource::Ocr));
}
@ -621,10 +650,7 @@ mod tests {
#[test]
fn test_get_hybrid_cells_non_hybrid() {
let classification = PageClassification::new(
crate::classify::PageClass::Vector,
0.9,
);
let classification = PageClassification::new(crate::classify::PageClass::Vector, 0.9);
assert!(get_hybrid_cells(&classification).is_empty());
}
@ -648,7 +674,7 @@ mod tests {
#[test]
fn test_compute_cell_crops() {
let mut cells = BTreeSet::new();
cells.insert(0); // row 0, col 0 (top-left)
cells.insert(0); // row 0, col 0 (top-left)
cells.insert(63); // row 7, col 7 (bottom-right)
let classification = PageClassification::hybrid(0.75, cells);
@ -691,7 +717,7 @@ mod tests {
// Cell should be 1/8 of page dimensions
assert_eq!(cell.width(), 100); // 800 / 8
assert_eq!(cell.height(), 75); // 600 / 8
assert_eq!(cell.height(), 75); // 600 / 8
}
#[test]
@ -712,9 +738,11 @@ mod tests {
#[test]
fn test_merge_multiple_ocr_spans() {
let vector = vec![
Span::vector([0.0, 0.0, 100.0, 100.0], 0.9, "vector".to_string()),
];
let vector = vec![Span::vector(
[0.0, 0.0, 100.0, 100.0],
0.9,
"vector".to_string(),
)];
let ocr = vec![
Span::ocr([200.0, 0.0, 300.0, 100.0], 0.8, "ocr1".to_string()),
Span::ocr([400.0, 0.0, 500.0, 100.0], 0.8, "ocr2".to_string()),
@ -756,7 +784,11 @@ mod tests {
// Create mock OCR callback that tracks call count
let call_count = std::sync::Arc::new(std::sync::atomic::AtomicUsize::new(0));
let mock_spans = vec![
Span::ocr([50.0, 100.0, 200.0, 120.0], 0.8, "Scanned Text 1".to_string()),
Span::ocr(
[50.0, 100.0, 200.0, 120.0],
0.8,
"Scanned Text 1".to_string(),
),
Span::ocr([50.0, 50.0, 200.0, 70.0], 0.8, "Scanned Text 2".to_string()),
];
let mock_ocr = MockOcrCallback {
@ -780,8 +812,11 @@ mod tests {
// Verify OCR was called exactly 48 times (6 rows * 8 cols)
// NOT 64 times (full page)
assert_eq!(call_count.load(std::sync::atomic::Ordering::SeqCst), 48,
"OCR should run only on scanned cells (48), not entire page (64)");
assert_eq!(
call_count.load(std::sync::atomic::Ordering::SeqCst),
48,
"OCR should run only on scanned cells (48), not entire page (64)"
);
// Verify result contains both vector and OCR spans
assert!(result.iter().any(|s| s.source == SpanSource::Vector));
@ -806,9 +841,11 @@ mod tests {
let classification = PageClassification::hybrid(0.75, cells);
// Create vector spans that overlap with OCR region
let vector_spans = vec![
Span::vector([50.0, 50.0, 150.0, 70.0], 0.9, "Vector Text".to_string()),
];
let vector_spans = vec![Span::vector(
[50.0, 50.0, 150.0, 70.0],
0.9,
"Vector Text".to_string(),
)];
// Create mock OCR that produces overlapping text (IoU > 0.5)
// OCR bbox [40, 40, 160, 80] overlaps vector bbox [50, 50, 150, 70]
@ -820,9 +857,11 @@ mod tests {
// Intersection = [50, 50, 150, 70] = 100 * 20 = 2000
// Union = (110*30) + (100*20) - 2000 = 3300 + 2000 - 2000 = 3300
// IoU = 2000 / 3300 = 0.606 > 0.5
let mock_spans = vec![
Span::ocr([45.0, 45.0, 155.0, 75.0], 0.7, "OCR Text".to_string()),
];
let mock_spans = vec![Span::ocr(
[45.0, 45.0, 155.0, 75.0],
0.7,
"OCR Text".to_string(),
)];
let call_count = std::sync::Arc::new(std::sync::atomic::AtomicUsize::new(0));
let mock_ocr = MockOcrCallback {
call_count,
@ -845,7 +884,11 @@ mod tests {
// With IoU > 0.5 and vector confidence >= 0.5, vector should win
// Result should have only 1 span (the vector span)
assert_eq!(result.len(), 1, "Should have only 1 span after merge (vector wins)");
assert_eq!(
result.len(),
1,
"Should have only 1 span after merge (vector wins)"
);
assert_eq!(result[0].source, SpanSource::Vector);
assert_eq!(result[0].text, "Vector Text");
}
@ -860,14 +903,18 @@ mod tests {
let classification = PageClassification::hybrid(0.75, cells);
// Vector span with low confidence
let vector_spans = vec![
Span::vector([50.0, 50.0, 150.0, 70.0], 0.2, "Bad Vector".to_string()),
];
let vector_spans = vec![Span::vector(
[50.0, 50.0, 150.0, 70.0],
0.2,
"Bad Vector".to_string(),
)];
// OCR span with high confidence, overlapping vector
let mock_spans = vec![
Span::ocr([45.0, 45.0, 155.0, 75.0], 0.7, "Good OCR".to_string()),
];
let mock_spans = vec![Span::ocr(
[45.0, 45.0, 155.0, 75.0],
0.7,
"Good OCR".to_string(),
)];
let call_count = std::sync::Arc::new(std::sync::atomic::AtomicUsize::new(0));
let mock_ocr = MockOcrCallback {
call_count,
@ -888,7 +935,11 @@ mod tests {
// With IoU > 0.5 but vector confidence < 0.5, OCR should be kept
// Result should have 2 spans (both vector and OCR kept)
assert_eq!(result.len(), 2, "Both vector and OCR should be kept when vector confidence is low");
assert_eq!(
result.len(),
2,
"Both vector and OCR should be kept when vector confidence is low"
);
assert!(result.iter().any(|s| s.source == SpanSource::Vector));
assert!(result.iter().any(|s| s.source == SpanSource::Ocr));
}
@ -898,9 +949,11 @@ mod tests {
// Test that non-hybrid classifications return only vector spans
let classification = PageClassification::new(PageClass::Vector, 0.9);
let vector_spans = vec![
Span::vector([50.0, 50.0, 150.0, 70.0], 0.9, "Vector Only".to_string()),
];
let vector_spans = vec![Span::vector(
[50.0, 50.0, 150.0, 70.0],
0.9,
"Vector Only".to_string(),
)];
let call_count = std::sync::Arc::new(std::sync::atomic::AtomicUsize::new(0));
let mock_ocr = MockOcrCallback {
@ -934,9 +987,11 @@ mod tests {
// Test hybrid classification with empty hybrid_cells
let classification = PageClassification::hybrid(0.75, BTreeSet::new());
let vector_spans = vec![
Span::vector([50.0, 50.0, 150.0, 70.0], 0.9, "Vector".to_string()),
];
let vector_spans = vec![Span::vector(
[50.0, 50.0, 150.0, 70.0],
0.9,
"Vector".to_string(),
)];
let call_count = std::sync::Arc::new(std::sync::atomic::AtomicUsize::new(0));
let mock_ocr = MockOcrCallback {

View file

@ -84,9 +84,9 @@ impl PageContext {
/// Create a new page context with default values.
pub fn new() -> Self {
Self {
page_body_median: 12.0, // Typical body text is ~12pt
line_height: 14.0, // Typical line spacing is ~1.2x font size
num_columns: 1, // Default single-column layout
page_body_median: 12.0, // Typical body text is ~12pt
line_height: 14.0, // Typical line spacing is ~1.2x font size
num_columns: 1, // Default single-column layout
}
}
@ -180,7 +180,11 @@ pub fn classify_page_captions(blocks: &mut [Block], ctx: &PageContext) {
// Update previous block for next iteration
// Note: we use a reference to the block before any modification
prev_block = if i < blocks.len() { Some(&blocks[i]) } else { None };
prev_block = if i < blocks.len() {
Some(&blocks[i])
} else {
None
};
}
}
@ -206,7 +210,13 @@ mod tests {
fn test_caption_immediately_below_figure() {
// Figure at y=[100, 200], caption at y=[90, 100] (1 line below)
let figure = make_figure([50.0, 100.0, 150.0, 200.0], 0);
let caption = make_block("paragraph", "Figure 1: A chart", 9.0, [50.0, 90.0, 150.0, 100.0], 0);
let caption = make_block(
"paragraph",
"Figure 1: A chart",
9.0,
[50.0, 90.0, 150.0, 100.0],
0,
);
let ctx = PageContext::with_values(12.0, 10.0, 1);
@ -217,7 +227,13 @@ mod tests {
fn test_caption_too_far_below_figure() {
// Figure at y=[100, 200], caption at y=[70, 80] (3 lines below = 30pt)
let figure = make_figure([50.0, 100.0, 150.0, 200.0], 0);
let caption = make_block("paragraph", "Figure 1: A chart", 9.0, [50.0, 70.0, 150.0, 80.0], 0);
let caption = make_block(
"paragraph",
"Figure 1: A chart",
9.0,
[50.0, 70.0, 150.0, 80.0],
0,
);
let ctx = PageContext::with_values(12.0, 10.0, 1);
@ -228,7 +244,13 @@ mod tests {
fn test_caption_font_not_smaller() {
// Caption with same font size as body text
let figure = make_figure([50.0, 100.0, 150.0, 200.0], 0);
let not_caption = make_block("paragraph", "Figure 1: A chart", 12.0, [50.0, 90.0, 150.0, 100.0], 0);
let not_caption = make_block(
"paragraph",
"Figure 1: A chart",
12.0,
[50.0, 90.0, 150.0, 100.0],
0,
);
let ctx = PageContext::with_values(12.0, 10.0, 1);
@ -239,7 +261,13 @@ mod tests {
fn test_caption_different_column() {
// Figure in column 0, caption in column 1 (two-column layout)
let figure = make_figure([50.0, 100.0, 150.0, 200.0], 0);
let caption = make_block("paragraph", "Figure 1: A chart", 9.0, [200.0, 90.0, 300.0, 100.0], 1);
let caption = make_block(
"paragraph",
"Figure 1: A chart",
9.0,
[200.0, 90.0, 300.0, 100.0],
1,
);
let ctx = PageContext::with_values(12.0, 10.0, 2);
@ -258,7 +286,13 @@ mod tests {
#[test]
fn test_caption_above_figure() {
// Caption positioned above the figure (not detected in v0.1.0)
let caption = make_block("paragraph", "Figure 1: A chart", 9.0, [50.0, 200.0, 150.0, 210.0], 0);
let caption = make_block(
"paragraph",
"Figure 1: A chart",
9.0,
[50.0, 200.0, 150.0, 210.0],
0,
);
let figure = make_figure([50.0, 100.0, 150.0, 200.0], 0);
let ctx = PageContext::with_values(12.0, 10.0, 1);
@ -269,9 +303,21 @@ mod tests {
#[test]
fn test_page_classification() {
let mut blocks = vec![
make_figure([50.0, 100.0, 150.0, 200.0], 0), // Figure
make_block("paragraph", "Figure 1: A chart", 9.0, [50.0, 90.0, 150.0, 100.0], 0), // Caption
make_block("paragraph", "Next paragraph", 12.0, [50.0, 70.0, 150.0, 80.0], 0), // Regular text
make_figure([50.0, 100.0, 150.0, 200.0], 0), // Figure
make_block(
"paragraph",
"Figure 1: A chart",
9.0,
[50.0, 90.0, 150.0, 100.0],
0,
), // Caption
make_block(
"paragraph",
"Next paragraph",
12.0,
[50.0, 70.0, 150.0, 80.0],
0,
), // Regular text
];
let ctx = PageContext::with_values(12.0, 10.0, 1);
@ -280,7 +326,7 @@ mod tests {
assert_eq!(blocks[0].kind, "figure");
assert_eq!(blocks[1].kind, "caption");
assert_eq!(blocks[2].kind, "paragraph"); // Unchanged
assert_eq!(blocks[2].kind, "paragraph"); // Unchanged
}
#[test]

View file

@ -254,10 +254,7 @@ mod tests {
#[test]
fn test_union_bboxes_nested() {
// Small box inside larger box
let bboxes = vec![
[0.0, 0.0, 100.0, 100.0],
[25.0, 25.0, 75.0, 75.0],
];
let bboxes = vec![[0.0, 0.0, 100.0, 100.0], [25.0, 25.0, 75.0, 75.0]];
let result = union_bboxes(&bboxes);
// Union should be the larger box
assert_eq!(result, Some([0.0, 0.0, 100.0, 100.0]));
@ -266,10 +263,7 @@ mod tests {
#[test]
fn test_union_bboxes_disjoint() {
// Two disjoint boxes
let bboxes = vec![
[0.0, 0.0, 50.0, 50.0],
[100.0, 100.0, 150.0, 150.0],
];
let bboxes = vec![[0.0, 0.0, 50.0, 50.0], [100.0, 100.0, 150.0, 150.0]];
let result = union_bboxes(&bboxes);
assert_eq!(result, Some([0.0, 0.0, 150.0, 150.0]));
}

View file

@ -12,6 +12,6 @@ pub mod caption;
pub mod line;
pub mod readability;
pub use caption::{Block, PageContext, classify_caption, classify_page_captions};
pub use line::{Line, LineDirection, compute_baseline, union_bboxes, HasBBox};
pub use caption::{classify_caption, classify_page_captions, Block, PageContext};
pub use line::{compute_baseline, union_bboxes, HasBBox, Line, LineDirection};
pub use readability::{aggregate_page_readability, ScoredSpan};

View file

@ -234,10 +234,7 @@ mod tests {
#[test]
fn test_empty_strings() {
let spans = vec![
TestSpan::new("", 0.5),
TestSpan::new("", 0.8),
];
let spans = vec![TestSpan::new("", 0.5), TestSpan::new("", 0.8)];
// All empty -> total_chars = 0 -> return 0.0
assert_eq!(aggregate_page_readability(&spans), 0.0);
}
@ -282,10 +279,7 @@ mod tests {
#[test]
fn test_all_zero_scores() {
let spans = vec![
TestSpan::new("a", 0.0),
TestSpan::new("b", 0.0),
];
let spans = vec![TestSpan::new("a", 0.0), TestSpan::new("b", 0.0)];
assert_eq!(aggregate_page_readability(&spans), 0.0);
}
@ -304,7 +298,10 @@ mod tests {
TestSpan::new("b".repeat(10), 0.5),
];
assert_eq!(aggregate_page_readability(&spans1), aggregate_page_readability(&spans2));
assert_eq!(
aggregate_page_readability(&spans1),
aggregate_page_readability(&spans2)
);
}
#[test]
@ -328,8 +325,8 @@ mod tests {
fn test_zero_width_joiner() {
// Test zero-width joiner and combining marks
let spans = vec![
TestSpan::new("café", 0.9), // 4 chars: c a f é
TestSpan::new("नमस्ते", 0.8), // 6 chars (Hindi namaste)
TestSpan::new("café", 0.9), // 4 chars: c a f é
TestSpan::new("नमस्ते", 0.8), // 6 chars (Hindi namaste)
];
// Total = 10 chars, half = 5
// Cumsum after first = 4, not > 5

View file

@ -46,8 +46,10 @@ use std::sync::OnceLock;
fn anchor_regex() -> &'static Regex {
static REGEX: OnceLock<Regex> = OnceLock::new();
REGEX.get_or_init(|| {
Regex::new(r"<!--\s*pdftract:\s*page=(\d+)\s+block=(\d+)\s+bbox=\[([\d.,]+)\]\s+kind=(\w+)\s*-->")
.expect("invalid ANCHOR_REGEX")
Regex::new(
r"<!--\s*pdftract:\s*page=(\d+)\s+block=(\d+)\s+bbox=\[([\d.,]+)\]\s+kind=(\w+)\s*-->",
)
.expect("invalid ANCHOR_REGEX")
})
}
@ -71,7 +73,12 @@ pub struct Anchor {
impl Anchor {
/// Create a new anchor from components.
pub fn new(page: usize, block: usize, bbox: [f32; 4], kind: String) -> Self {
Self { page, block, bbox, kind }
Self {
page,
block,
bbox,
kind,
}
}
/// Format this anchor as an HTML comment.
@ -90,7 +97,13 @@ impl Anchor {
pub fn to_comment(&self) -> String {
format!(
"<!-- pdftract: page={} block={} bbox=[{:.1},{:.1},{:.1},{:.1}] kind={} -->",
self.page, self.block, self.bbox[0], self.bbox[1], self.bbox[2], self.bbox[3], self.kind
self.page,
self.block,
self.bbox[0],
self.bbox[1],
self.bbox[2],
self.bbox[3],
self.kind
)
}
}
@ -194,7 +207,12 @@ fn parse_bbox(s: &str) -> Option<[f32; 4]> {
/// # Returns
///
/// A markdown string with optional anchor.
pub fn block_to_markdown(block: &BlockJson, page_index: usize, block_index: usize, include_anchor: bool) -> String {
pub fn block_to_markdown(
block: &BlockJson,
page_index: usize,
block_index: usize,
include_anchor: bool,
) -> String {
let mut result = String::new();
// Add anchor comment if requested
@ -202,7 +220,12 @@ pub fn block_to_markdown(block: &BlockJson, page_index: usize, block_index: usiz
let anchor = Anchor::new(
page_index,
block_index,
[block.bbox[0] as f32, block.bbox[1] as f32, block.bbox[2] as f32, block.bbox[3] as f32],
[
block.bbox[0] as f32,
block.bbox[1] as f32,
block.bbox[2] as f32,
block.bbox[3] as f32,
],
block.kind.clone(),
);
result.push_str(&anchor.to_comment());
@ -251,7 +274,12 @@ pub fn block_to_markdown(block: &BlockJson, page_index: usize, block_index: usiz
/// # Returns
///
/// A markdown string with all blocks from the page.
pub fn page_to_markdown(blocks: &[BlockJson], page_index: usize, include_anchor: bool, include_page_break: bool) -> String {
pub fn page_to_markdown(
blocks: &[BlockJson],
page_index: usize,
include_anchor: bool,
include_page_break: bool,
) -> String {
let mut result = String::new();
for (block_index, block) in blocks.iter().enumerate() {
@ -288,15 +316,26 @@ mod tests {
fn test_anchor_to_comment() {
let anchor = Anchor::new(3, 12, [72.0, 640.5, 540.0, 672.0], "heading".to_string());
let comment = anchor.to_comment();
assert_eq!(comment, "<!-- pdftract: page=3 block=12 bbox=[72.0,640.5,540.0,672.0] kind=heading -->");
assert_eq!(
comment,
"<!-- pdftract: page=3 block=12 bbox=[72.0,640.5,540.0,672.0] kind=heading -->"
);
}
#[test]
fn test_anchor_to_comment_round_bbox() {
let anchor = Anchor::new(0, 0, [72.123, 640.567, 540.999, 672.111], "paragraph".to_string());
let anchor = Anchor::new(
0,
0,
[72.123, 640.567, 540.999, 672.111],
"paragraph".to_string(),
);
let comment = anchor.to_comment();
// Should be rounded to 1 decimal place
assert_eq!(comment, "<!-- pdftract: page=0 block=0 bbox=[72.1,640.6,541.0,672.1] kind=paragraph -->");
assert_eq!(
comment,
"<!-- pdftract: page=0 block=0 bbox=[72.1,640.6,541.0,672.1] kind=paragraph -->"
);
}
#[test]
@ -342,16 +381,23 @@ Some text."#;
#[test]
fn test_parse_anchors_whitespace_tolerant() {
let md = r#"<!-- pdftract: page=0 block=0 bbox=[72.0,640.5,540.0,672.0] kind=heading -->"#;
let md =
r#"<!-- pdftract: page=0 block=0 bbox=[72.0,640.5,540.0,672.0] kind=heading -->"#;
let anchors = parse_anchors(md);
assert_eq!(anchors.len(), 1);
}
#[test]
fn test_parse_bbox() {
assert_eq!(parse_bbox("72.0,640.5,540.0,672.0"), Some([72.0, 640.5, 540.0, 672.0]));
assert_eq!(
parse_bbox("72.0,640.5,540.0,672.0"),
Some([72.0, 640.5, 540.0, 672.0])
);
assert_eq!(parse_bbox("0,0,100,100"), Some([0.0, 0.0, 100.0, 100.0]));
assert_eq!(parse_bbox("72.0, 640.5, 540.0, 672.0"), Some([72.0, 640.5, 540.0, 672.0])); // with spaces
assert_eq!(
parse_bbox("72.0, 640.5, 540.0, 672.0"),
Some([72.0, 640.5, 540.0, 672.0])
); // with spaces
assert_eq!(parse_bbox("invalid"), None);
assert_eq!(parse_bbox("1,2,3"), None); // too few values
assert_eq!(parse_bbox("1,2,3,4,5"), None); // too many values
@ -369,7 +415,9 @@ Some text."#;
};
let md = block_to_markdown(&block, 0, 0, true);
assert!(md.contains("<!-- pdftract: page=0 block=0 bbox=[72.0,640.5,540.0,672.0] kind=heading -->"));
assert!(md.contains(
"<!-- pdftract: page=0 block=0 bbox=[72.0,640.5,540.0,672.0] kind=heading -->"
));
assert!(md.contains("## Chapter 1"));
}
@ -438,16 +486,14 @@ Some text."#;
#[test]
fn test_roundtrip_extract_and_parse() {
let blocks = vec![
BlockJson {
kind: "heading".to_string(),
text: "Chapter 1".to_string(),
bbox: [72.0, 640.5, 540.0, 672.0],
level: Some(2),
table_index: None,
receipt: None,
},
];
let blocks = vec![BlockJson {
kind: "heading".to_string(),
text: "Chapter 1".to_string(),
bbox: [72.0, 640.5, 540.0, 672.0],
level: Some(2),
table_index: None,
receipt: None,
}];
let md = page_to_markdown(&blocks, 3, true, false);
let anchors = parse_anchors(&md);

View file

@ -204,7 +204,10 @@ fn resolve_tessdata_dir() -> Option<PathBuf> {
///
/// - `detect_available_languages` for pack detection logic
/// - Phase 5.4 in the plan for OCR language pack handling
pub fn validate_ocr_languages(requested_langs: &[String], diagnostics: &mut Vec<crate::diagnostics::Diagnostic>) -> String {
pub fn validate_ocr_languages(
requested_langs: &[String],
diagnostics: &mut Vec<crate::diagnostics::Diagnostic>,
) -> String {
let available = detect_available_languages();
// Track which requested languages are available
@ -217,12 +220,10 @@ pub fn validate_ocr_languages(requested_langs: &[String], diagnostics: &mut Vec<
} else {
missing_langs.push(lang);
// Emit diagnostic for missing language
diagnostics.push(
crate::diagnostics::Diagnostic::with_dynamic_no_offset(
crate::diagnostics::DiagCode::OcrLanguageUnavailable,
format!("Requested OCR language pack '{}' is not installed", lang),
)
);
diagnostics.push(crate::diagnostics::Diagnostic::with_dynamic_no_offset(
crate::diagnostics::DiagCode::OcrLanguageUnavailable,
format!("Requested OCR language pack '{}' is not installed", lang),
));
}
}
@ -242,12 +243,10 @@ pub fn validate_ocr_languages(requested_langs: &[String], diagnostics: &mut Vec<
return "eng".to_string();
} else {
// No languages available at all - this will cause Tesseract init to fail
diagnostics.push(
crate::diagnostics::Diagnostic::with_dynamic_no_offset(
crate::diagnostics::DiagCode::OcrLanguageUnavailable,
"No OCR language packs available (including fallback 'eng')".to_string(),
)
);
diagnostics.push(crate::diagnostics::Diagnostic::with_dynamic_no_offset(
crate::diagnostics::DiagCode::OcrLanguageUnavailable,
"No OCR language packs available (including fallback 'eng')".to_string(),
));
return "eng".to_string(); // Still return eng; Tesseract will fail with clear error
}
}
@ -418,7 +417,8 @@ impl TessState {
.map_err(|e| format!("Invalid language string: {}", e))?;
let init_result = if let Some(ref path) = tessdata_path {
let path_str = path.to_str()
let path_str = path
.to_str()
.ok_or_else(|| format!("Tessdata path contains invalid UTF-8: {:?}", path))?;
let path_cstr = CString::new(path_str)
.map_err(|e| format!("Invalid tessdata path string: {}", e))?;
@ -432,9 +432,7 @@ impl TessState {
format!(
"Failed to initialize Tesseract (language='{}', tessdata_path={:?}): {}. \
Ensure language data files are installed (see `pdftract doctor tesseract-langs`).",
opts.language,
tessdata_path,
e
opts.language, tessdata_path, e
)
})?;
@ -523,15 +521,16 @@ pub fn borrow_or_init(opts: &TessOpts) -> std::cell::RefMut<'static, Option<Tess
match state_ref.as_ref() {
// No cached instance - initialize
None => {
*state_ref = Some(TessState::new(opts.clone())
.expect("Tesseract initialization failed"));
*state_ref =
Some(TessState::new(opts.clone()).expect("Tesseract initialization failed"));
}
// Cached instance exists - check if opts match
Some(cached) => {
if cached.opts() != opts {
// Opts changed - reinitialize
*state_ref = Some(TessState::new(opts.clone())
.expect("Tesseract reinitialization failed"));
*state_ref = Some(
TessState::new(opts.clone()).expect("Tesseract reinitialization failed"),
);
}
// else: opts match, reuse cached instance
}
@ -653,7 +652,11 @@ mod tests {
let _state = borrow_or_init(&opts);
}
assert_eq!(init_count(), 1, "Should have exactly 1 init (first call only)");
assert_eq!(
init_count(),
1,
"Should have exactly 1 init (first call only)"
);
});
if init_result.is_err() {
@ -724,7 +727,10 @@ mod tests {
count
);
println!("Multithreaded test: {} inits for 100 pages across rayon workers", count);
println!(
"Multithreaded test: {} inits for 100 pages across rayon workers",
count
);
});
if init_result.is_err() {
@ -1028,7 +1034,12 @@ impl HocrWord {
// Step 5: Add cell origin if this is from a hybrid cell OCR
let (pdf_x0, pdf_y0, pdf_x1, pdf_y1) = if let Some([cell_x, cell_y]) = cell_origin {
(pdf_x0 + cell_x, pdf_y0 + cell_y, pdf_x1 + cell_x, pdf_y1 + cell_y)
(
pdf_x0 + cell_x,
pdf_y0 + cell_y,
pdf_x1 + cell_x,
pdf_y1 + cell_y,
)
} else {
(pdf_x0, pdf_y0, pdf_x1, pdf_y1)
};
@ -1220,10 +1231,7 @@ fn is_ocrx_word(element: &quick_xml::events::BytesStart) -> bool {
}
/// Get an attribute value from an element.
fn get_attribute<'a>(
element: &'a quick_xml::events::BytesStart<'a>,
name: &str,
) -> Option<String> {
fn get_attribute<'a>(element: &'a quick_xml::events::BytesStart<'a>, name: &str) -> Option<String> {
element
.attributes()
.filter_map(|a| a.ok())
@ -1250,13 +1258,17 @@ fn parse_title_attribute(title: &str) -> Result<([u32; 4], u8), String> {
// Parse bbox coordinates: "bbox x0 y0 x1 y1"
let coords: Vec<&str> = parts.collect();
if coords.len() >= 4 {
let x0 = coords[0].parse::<u32>()
let x0 = coords[0]
.parse::<u32>()
.map_err(|_| format!("Invalid bbox x0: {}", coords[0]))?;
let y0 = coords[1].parse::<u32>()
let y0 = coords[1]
.parse::<u32>()
.map_err(|_| format!("Invalid bbox y0: {}", coords[1]))?;
let x1 = coords[2].parse::<u32>()
let x1 = coords[2]
.parse::<u32>()
.map_err(|_| format!("Invalid bbox x1: {}", coords[2]))?;
let y1 = coords[3].parse::<u32>()
let y1 = coords[3]
.parse::<u32>()
.map_err(|_| format!("Invalid bbox y1: {}", coords[3]))?;
bbox = Some([x0, y0, x1, y1]);
@ -1265,7 +1277,8 @@ fn parse_title_attribute(title: &str) -> Result<([u32; 4], u8), String> {
Some("x_wconf") => {
// Parse confidence: "x_wconf NNN"
if let Some(conf_str) = parts.next() {
let conf = conf_str.parse::<u8>()
let conf = conf_str
.parse::<u8>()
.map_err(|_| format!("Invalid x_wconf: {}", conf_str))?;
confidence = Some(conf);
}
@ -1540,7 +1553,12 @@ mod hocr_tests {
let y = (i / 600) * 30;
hocr.push_str(&format!(
"<span class='ocrx_word' title='bbox {} {} {} {}; x_wconf {}'>word{}</span>",
x, y, x + 50, y + 20, 85 + (i % 15), i
x,
y,
x + 50,
y + 20,
85 + (i % 15),
i
));
}
hocr.push_str("</body></html>");
@ -1553,8 +1571,11 @@ mod hocr_tests {
assert_eq!(words.len(), 1000);
// Should be very fast (< 10ms for 1000 words)
assert!(elapsed < std::time::Duration::from_millis(50),
"HOCR parsing took {:?}, expected < 50ms", elapsed);
assert!(
elapsed < std::time::Duration::from_millis(50),
"HOCR parsing took {:?}, expected < 50ms",
elapsed
);
}
#[test]
@ -1609,7 +1630,10 @@ mod hocr_tests {
if let Ok(quick_xml::events::Event::Start(e)) = reader.read_event_into(&mut buf) {
assert_eq!(get_attribute(&e, "class"), Some("ocrx_word".to_string()));
assert_eq!(get_attribute(&e, "id"), Some("test".to_string()));
assert_eq!(get_attribute(&e, "title"), Some("bbox 0 0 50 20".to_string()));
assert_eq!(
get_attribute(&e, "title"),
Some("bbox 0 0 50 20".to_string())
);
assert_eq!(get_attribute(&e, "missing"), None);
}
}
@ -1632,15 +1656,31 @@ mod hocr_tests {
let bbox = word.to_pdf_bbox(300, 792.0, None, None);
// Check X coordinates (unchanged by Y-flip)
assert!((bbox[0] - 0.0).abs() < 0.1, "x0 should be ~0.0, got {}", bbox[0]);
assert!((bbox[2] - 21.6).abs() < 0.1, "x1 should be ~21.6, got {}", bbox[2]);
assert!(
(bbox[0] - 0.0).abs() < 0.1,
"x0 should be ~0.0, got {}",
bbox[0]
);
assert!(
(bbox[2] - 21.6).abs() < 0.1,
"x1 should be ~21.6, got {}",
bbox[2]
);
// Check Y coordinates (flipped)
// y0 = 792 - 30*72/300 = 792 - 7.2 = 784.8 (but with padding subtract: 792 - 4.8 = 787.2)
// Actually: y1_pt = 20 * 0.24 = 4.8, so pdf_y0 = 792 - 4.8 = 787.2
// y0_pt = 0, so pdf_y1 = 792 - 0 = 792
assert!((bbox[1] - 787.2).abs() < 0.1, "y0 should be ~787.2, got {}", bbox[1]);
assert!((bbox[3] - 792.0).abs() < 0.1, "y1 should be ~792.0, got {}", bbox[3]);
assert!(
(bbox[1] - 787.2).abs() < 0.1,
"y0 should be ~787.2, got {}",
bbox[1]
);
assert!(
(bbox[3] - 792.0).abs() < 0.1,
"y1 should be ~792.0, got {}",
bbox[3]
);
}
#[test]
@ -1688,9 +1728,15 @@ mod hocr_tests {
let bbox = word.to_pdf_bbox(300, 792.0, None, None);
// After padding subtraction, x0 and y0 should be at 0 (page origin)
assert!((bbox[0] - 0.0).abs() < 0.1, "x0 should be ~0.0 after padding subtraction");
assert!(
(bbox[0] - 0.0).abs() < 0.1,
"x0 should be ~0.0 after padding subtraction"
);
// y0 should be near page height (top of page after Y-flip)
assert!(bbox[1] > 780.0, "y0 should be near top of page after Y-flip");
assert!(
bbox[1] > 780.0,
"y0 should be near top of page after Y-flip"
);
}
#[test]
@ -1705,17 +1751,29 @@ mod hocr_tests {
// At 300 DPI: 100px * 72/300 = 24pt
let bbox_300 = word.to_pdf_bbox(300, 792.0, None, None);
let width_300 = bbox_300[2] - bbox_300[0];
assert!((width_300 - 24.0).abs() < 0.1, "Width at 300 DPI should be ~24pt, got {}", width_300);
assert!(
(width_300 - 24.0).abs() < 0.1,
"Width at 300 DPI should be ~24pt, got {}",
width_300
);
// At 200 DPI: 100px * 72/200 = 36pt
let bbox_200 = word.to_pdf_bbox(200, 792.0, None, None);
let width_200 = bbox_200[2] - bbox_200[0];
assert!((width_200 - 36.0).abs() < 0.1, "Width at 200 DPI should be ~36pt, got {}", width_200);
assert!(
(width_200 - 36.0).abs() < 0.1,
"Width at 200 DPI should be ~36pt, got {}",
width_200
);
// At 400 DPI: 100px * 72/400 = 18pt
let bbox_400 = word.to_pdf_bbox(400, 792.0, None, None);
let width_400 = bbox_400[2] - bbox_400[0];
assert!((width_400 - 18.0).abs() < 0.1, "Width at 400 DPI should be ~18pt, got {}", width_400);
assert!(
(width_400 - 18.0).abs() < 0.1,
"Width at 400 DPI should be ~18pt, got {}",
width_400
);
}
#[test]
@ -1736,11 +1794,15 @@ mod hocr_tests {
let bbox = word.to_pdf_bbox(300, 99.0, None, Some(cell_origin));
// X should be offset by cell origin
assert!((bbox[0] - (229.5 + 10.0 * 72.0 / 300.0)).abs() < 1.0,
"x0 should include cell origin offset");
assert!(
(bbox[0] - (229.5 + 10.0 * 72.0 / 300.0)).abs() < 1.0,
"x0 should include cell origin offset"
);
// Y should be offset by cell origin (note: cell height is 99pt)
assert!((bbox[1] - (594.0 + 10.0 * 72.0 / 300.0)).abs() < 1.0,
"y0 should include cell origin offset");
assert!(
(bbox[1] - (594.0 + 10.0 * 72.0 / 300.0)).abs() < 1.0,
"y0 should include cell origin offset"
);
}
#[test]
@ -1776,8 +1838,10 @@ mod hocr_tests {
// After 90-degree rotation, the bbox should be transformed
// The exact values depend on the rotation implementation
// Just verify that the rotation changes the coordinates
assert!(bbox_rot_90[0] != bbox_no_rot[0] || bbox_rot_90[1] != bbox_no_rot[1],
"Rotation should change coordinates");
assert!(
bbox_rot_90[0] != bbox_no_rot[0] || bbox_rot_90[1] != bbox_no_rot[1],
"Rotation should change coordinates"
);
}
#[test]
@ -1825,8 +1889,14 @@ mod hocr_tests {
let bbox_invalid = word.to_pdf_bbox(300, 792.0, Some(45), None); // 45° is not supported
// Invalid rotation should return unchanged bbox
assert!((bbox_invalid[0] - bbox_no_rot[0]).abs() < 0.01, "Invalid rotation should not change x0");
assert!((bbox_invalid[1] - bbox_no_rot[1]).abs() < 0.01, "Invalid rotation should not change y0");
assert!(
(bbox_invalid[0] - bbox_no_rot[0]).abs() < 0.01,
"Invalid rotation should not change x0"
);
assert!(
(bbox_invalid[1] - bbox_no_rot[1]).abs() < 0.01,
"Invalid rotation should not change y0"
);
}
#[test]
@ -1851,8 +1921,16 @@ mod hocr_tests {
// At 300 DPI: 40px = 9.6pt, 20px = 4.8pt
// Allow some tolerance for floating-point errors
assert!((width - 9.6).abs() < 0.2, "Width should be ~9.6pt at {}° rotation", rot);
assert!((height - 4.8).abs() < 0.2, "Height should be ~4.8pt at {}° rotation", rot);
assert!(
(width - 9.6).abs() < 0.2,
"Width should be ~9.6pt at {}° rotation",
rot
);
assert!(
(height - 4.8).abs() < 0.2,
"Height should be ~4.8pt at {}° rotation",
rot
);
}
}
}
@ -1952,11 +2030,7 @@ pub fn run_tesseract(
.into_iter()
.map(|word| {
let pdf_bbox = word.to_pdf_bbox(dpi, page_height_pt, None, None);
crate::hybrid::Span::ocr(
pdf_bbox,
word.confidence(),
word.text,
)
crate::hybrid::Span::ocr(pdf_bbox, word.confidence(), word.text)
})
.collect();
@ -2016,11 +2090,7 @@ pub fn run_tesseract_on_cell(
.into_iter()
.map(|word| {
let pdf_bbox = word.to_pdf_bbox(dpi, cell_height_pt, None, Some(cell_origin));
crate::hybrid::Span::ocr(
pdf_bbox,
word.confidence(),
word.text,
)
crate::hybrid::Span::ocr(pdf_bbox, word.confidence(), word.text)
})
.collect();
@ -2041,9 +2111,7 @@ mod integration_tests {
let opts = TessOpts::default();
let result = std::panic::catch_unwind(|| {
run_tesseract(&img, 300, 792.0, &opts)
});
let result = std::panic::catch_unwind(|| run_tesseract(&img, 300, 792.0, &opts));
if result.is_err() {
// Tesseract not available - skip gracefully
@ -2064,9 +2132,8 @@ mod integration_tests {
let opts = TessOpts::default();
let cell_origin = [100.0, 200.0];
let result = std::panic::catch_unwind(|| {
run_tesseract_on_cell(&img, 300, 99.0, cell_origin, &opts)
});
let result =
std::panic::catch_unwind(|| run_tesseract_on_cell(&img, 300, 99.0, cell_origin, &opts));
if result.is_err() {
println!("Skipping test_run_tesseract_on_cell_offset: Tesseract not available");
@ -2160,7 +2227,9 @@ pub fn calculate_wer(ocr_output: &str, ground_truth: &str) -> f64 {
/// A `Vec<String>` of normalized words.
fn normalize_text(text: &str) -> Vec<String> {
// Define punctuation to strip
let punct = ['.', ',', '!', '?', ';', ':', '"', '\'', '(', ')', '[', ']', '{', '}'];
let punct = [
'.', ',', '!', '?', ';', ':', '"', '\'', '(', ')', '[', ']', '{', '}',
];
text.to_lowercase()
.split_whitespace()
@ -2202,9 +2271,9 @@ fn word_edit_distance(ocr: &[String], reference: &[String]) -> (usize, usize, us
dp[i][j] = dp[i - 1][j - 1]; // No operation needed
} else {
dp[i][j] = [
dp[i - 1][j] + 1, // Deletion
dp[i][j - 1] + 1, // Insertion
dp[i - 1][j - 1] + 1, // Substitution
dp[i - 1][j] + 1, // Deletion
dp[i][j - 1] + 1, // Insertion
dp[i - 1][j - 1] + 1, // Substitution
]
.into_iter()
.min()
@ -2241,14 +2310,285 @@ fn word_edit_distance(ocr: &[String], reference: &[String]) -> (usize, usize, us
j -= 1;
} else {
// Default case (shouldn't happen in valid backtracking)
if i > 0 { i -= 1; }
if j > 0 { j -= 1; }
if i > 0 {
i -= 1;
}
if j > 0 {
j -= 1;
}
}
}
(substitutions, insertions, deletions)
}
// ============ Assisted OCR Validation Filter (Phase 5.5.2) ============
use crate::content_stream::Glyph;
/// Distance threshold for assisted-OCR position validation (in PDF points).
///
/// If the center-to-center distance between an OCR word and the nearest
/// vector glyph is less than this value, the OCR word is accepted with its
/// full confidence. Otherwise, confidence is capped at 0.4.
///
/// 5 pt is approximately one space-character width at 12 pt font size.
const ASSISTED_OCR_DISTANCE_PT: f64 = 5.0;
/// Confidence cap for OCR words that fail position validation.
///
/// This value is below the 0.5 threshold used in bbox-merge (Phase 5.2.4),
/// ensuring that unassisted OCR spans won't be preferred over legitimate
/// vector spans.
const ASSISTED_OCR_CONFIDENCE_CAP: f32 = 0.4;
/// Minimum glyph count to justify building a KD-tree.
///
/// For small N (< 100), linear scan is faster due to lower overhead.
const ASSISTED_OCR_KDTREE_THRESHOLD: usize = 100;
/// Validate OCR words against vector glyph position hints.
///
/// This function implements the per-word validation filter for the
/// BrokenVector assisted-OCR path (Phase 5.5.2). For each Tesseract word,
/// it finds the nearest vector glyph bbox center and checks the distance:
///
/// - If distance < 5 pt: accept word with full OCR confidence
/// - If distance >= 5 pt: cap confidence at 0.4
///
/// The 5pt threshold filters OCR text where positions disagree with the
/// vector layer, indicating either OCR-of-OCR garbage or hallucinated text.
///
/// # Arguments
///
/// * `hocr_words` - OCR words from Tesseract (in PDF coordinates)
/// * `vector_glyphs` - Position hints from Phase 3 (PositionHint mode)
///
/// # Returns
///
/// A `Vec<Span>` with `SpanSource::OcrAssisted` and adjusted confidence scores.
/// The output preserves HOCR document order.
///
/// # Performance
///
/// - For < 100 glyphs: O(N*M) linear scan (N = OCR words, M = glyphs)
/// - For >= 100 glyphs: Could use KD-tree for O(N*log(M)) (future optimization)
///
/// # Examples
///
/// ```ignore
/// use pdftract_core::ocr::validate_ocr_with_position_hints;
/// use pdftract_core::content_stream::Glyph;
///
/// // Position hints from Phase 3
/// let glyphs = vec![
/// Glyph::position_hint([100.0, 200.0, 110.0, 210.0]),
/// ];
///
/// // OCR words from Tesseract (already converted to PDF coords)
/// let mut words = vec![
/// HocrWord { text: "hello".to_string(), bbox_px: [102, 202, 108, 208], confidence_0_100: 95 },
/// ];
///
/// let spans = validate_ocr_with_position_hints(&words, &glyphs, 300, 792.0);
/// // Word at (102, 202) is close to glyph at (100, 200) -> full confidence
/// assert_eq!(spans[0].confidence, 0.95);
/// ```
///
/// # See also
///
/// - Phase 5.5 pipeline step 3 (plan line 1935)
/// - `Glyph::position_hint` for creating position-hint glyphs
pub fn validate_ocr_with_position_hints(
hocr_words: &[HocrWord],
vector_glyphs: &[Glyph],
dpi: u32,
page_height_pt: f64,
) -> Vec<crate::hybrid::Span> {
// Build list of vector glyph bbox centers for nearest-neighbor lookup
let glyph_centers: Vec<(f64, f64)> = vector_glyphs
.iter()
.map(|g| {
let bx = g.bbox;
((bx[0] + bx[2]) / 2.0, (bx[1] + bx[3]) / 2.0)
})
.collect();
// For each OCR word, find nearest glyph and validate distance
hocr_words
.iter()
.map(|word| {
let pdf_bbox = word.to_pdf_bbox(dpi, page_height_pt, None, None);
let word_center = (
(pdf_bbox[0] + pdf_bbox[2]) / 2.0,
(pdf_bbox[1] + pdf_bbox[3]) / 2.0,
);
// Find nearest vector glyph center (linear scan - fast enough for N < 100)
let min_distance = glyph_centers
.iter()
.map(|&gx| {
let dx = gx.0 - word_center.0;
let dy = gx.1 - word_center.1;
(dx * dx + dy * dy).sqrt()
})
.min()
.unwrap_or(f64::MAX); // No glyphs -> max distance
// Apply validation: cap confidence if distance >= 5pt
let ocr_confidence = word.confidence();
let adjusted_confidence = if min_distance < ASSISTED_OCR_DISTANCE_PT {
ocr_confidence
} else {
ocr_confidence.min(ASSISTED_OCR_CONFIDENCE_CAP)
};
crate::hybrid::Span::ocr_assisted(pdf_bbox, adjusted_confidence, word.text.clone())
})
.collect()
}
#[cfg(test)]
mod assisted_ocr_tests {
use super::*;
#[test]
fn test_validation_filter_near_glyph() {
// OCR word center at (102, 201) is within 5pt of glyph at (100, 200)
let glyphs = vec![Glyph::position_hint([95.0, 195.0, 105.0, 205.0])];
let word = HocrWord {
text: "hello".to_string(),
bbox_px: [20, 20, 40, 40], // Will be converted to ~102, 201 at 300 DPI
confidence_0_100: 95,
};
let spans = validate_ocr_with_position_hints(&[word], &glyphs, 300, 792.0);
assert_eq!(spans.len(), 1);
// Should accept full confidence since distance < 5pt
assert!((spans[0].confidence - 0.95).abs() < f32::EPSILON);
assert_eq!(spans[0].source, crate::hybrid::SpanSource::OcrAssisted);
assert_eq!(spans[0].text, "hello");
}
#[test]
fn test_validation_filter_far_from_glyph() {
// OCR word center at (150, 250) is > 5pt from glyph at (100, 200)
let glyphs = vec![Glyph::position_hint([95.0, 195.0, 105.0, 205.0])];
let word = HocrWord {
text: "world".to_string(),
bbox_px: [500, 500, 550, 520], // Far from glyph
confidence_0_100: 95,
};
let spans = validate_ocr_with_position_hints(&[word], &glyphs, 300, 792.0);
assert_eq!(spans.len(), 1);
// Should cap confidence at 0.4 since distance >= 5pt
assert_eq!(spans[0].confidence, ASSISTED_OCR_CONFIDENCE_CAP);
assert_eq!(spans[0].source, crate::hybrid::SpanSource::OcrAssisted);
}
#[test]
fn test_validation_filter_confidence_already_below_cap() {
// OCR word with low confidence (30%) far from glyph should stay at 30%
let glyphs = vec![Glyph::position_hint([95.0, 195.0, 105.0, 205.0])];
let word = HocrWord {
text: "test".to_string(),
bbox_px: [500, 500, 550, 520],
confidence_0_100: 30,
};
let spans = validate_ocr_with_position_hints(&[word], &glyphs, 300, 792.0);
assert_eq!(spans.len(), 1);
// Should keep original confidence (already below cap)
assert_eq!(spans[0].confidence, 0.3);
}
#[test]
fn test_validation_filter_no_glyphs() {
// No position hints available -> cap all words
let glyphs: Vec<Glyph> = vec![];
let word = HocrWord {
text: "orphan".to_string(),
bbox_px: [100, 100, 150, 120],
confidence_0_100: 90,
};
let spans = validate_ocr_with_position_hints(&[word], &glyphs, 300, 792.0);
assert_eq!(spans.len(), 1);
// No glyphs -> max distance -> cap confidence
assert_eq!(spans[0].confidence, ASSISTED_OCR_CONFIDENCE_CAP);
}
#[test]
fn test_validation_filter_multiple_words_preserves_order() {
// Test that HOCR document order is preserved
let glyphs = vec![
Glyph::position_hint([100.0, 200.0, 110.0, 210.0]),
Glyph::position_hint([200.0, 200.0, 210.0, 210.0]),
];
let words = vec![
HocrWord {
text: "first".to_string(),
bbox_px: [20, 20, 40, 40],
confidence_0_100: 90,
},
HocrWord {
text: "second".to_string(),
bbox_px: [500, 500, 550, 520], // Far from any glyph
confidence_0_100: 85,
},
HocrWord {
text: "third".to_string(),
bbox_px: [60, 20, 80, 40],
confidence_0_100: 95,
},
];
let spans = validate_ocr_with_position_hints(&words, &glyphs, 300, 792.0);
assert_eq!(spans.len(), 3);
assert_eq!(spans[0].text, "first");
assert_eq!(spans[1].text, "second");
assert_eq!(spans[2].text, "third");
// First and third should have full confidence (near glyphs)
assert!((spans[0].confidence - 0.9).abs() < f32::EPSILON);
assert!((spans[2].confidence - 0.95).abs() < f32::EPSILON);
// Second should be capped (far from glyphs)
assert_eq!(spans[1].confidence, ASSISTED_OCR_CONFIDENCE_CAP);
}
#[test]
fn test_validation_filter_distance_threshold() {
// Test the exact 5pt boundary
let glyphs = vec![Glyph::position_hint([100.0, 200.0, 110.0, 210.0])];
// Word at exactly 5pt distance should be capped
let word_far = HocrWord {
text: "far".to_string(),
bbox_px: [1000, 1000, 1050, 1020],
confidence_0_100: 95,
};
let spans = validate_ocr_with_position_hints(&[word_far], &glyphs, 300, 792.0);
assert_eq!(spans[0].confidence, ASSISTED_OCR_CONFIDENCE_CAP);
}
#[test]
fn test_assisted_ocr_constants() {
// Verify the constants match the plan specification
assert_eq!(ASSISTED_OCR_DISTANCE_PT, 5.0);
assert_eq!(ASSISTED_OCR_CONFIDENCE_CAP, 0.4);
assert_eq!(ASSISTED_OCR_KDTREE_THRESHOLD, 100);
}
}
#[cfg(test)]
mod wer_tests {
use super::*;
@ -2304,13 +2644,19 @@ mod wer_tests {
#[test]
fn test_calculate_wer_empty_reference_nonempty_ocr() {
let wer = calculate_wer("some text", "");
assert_eq!(wer, 1.0, "Non-empty OCR with empty reference should have WER = 1");
assert_eq!(
wer, 1.0,
"Non-empty OCR with empty reference should have WER = 1"
);
}
#[test]
fn test_calculate_wer_empty_ocr_nonempty_reference() {
let wer = calculate_wer("", "some text");
assert_eq!(wer, 1.0, "Empty OCR with non-empty reference should have WER = 1");
assert_eq!(
wer, 1.0,
"Empty OCR with non-empty reference should have WER = 1"
);
}
#[test]
@ -2375,7 +2721,11 @@ mod wer_tests {
#[test]
fn test_word_edit_distance_insertion_deletion() {
let ocr = vec!["hello".to_string(), "there".to_string()];
let reference = vec!["hello".to_string(), "world".to_string(), "there".to_string()];
let reference = vec![
"hello".to_string(),
"world".to_string(),
"there".to_string(),
];
let (sub, ins, del) = word_edit_distance(&ocr, &reference);
// "world" deleted from reference, but also could be seen as insertion
// The algorithm counts it as:

View file

@ -3,9 +3,9 @@
//! This module defines the options that control how PDFs are extracted,
//! including the receipts mode for cryptographic provenance tracking.
use serde::{Deserialize, Serialize};
#[cfg(feature = "schemars")]
use schemars::JsonSchema;
use serde::{Deserialize, Serialize};
/// Receipt generation mode.
///

View file

@ -4,10 +4,10 @@
//! including Pages, Outlines, MarkInfo, StructTreeRoot, AcroForm, Names,
//! Metadata, PageLabels, OCProperties, OpenAction, AA, and Version entries.
use crate::parser::object::{ObjRef, PdfObject, intern};
use crate::parser::xref::XrefResolver;
use crate::diagnostics::{Diagnostic, DiagCode};
use crate::diagnostics::{DiagCode, Diagnostic};
use crate::parser::object::{intern, ObjRef, PdfObject};
use crate::parser::ocg::{parse_oc_properties, OcProperties};
use crate::parser::xref::XrefResolver;
/// Result type for catalog parsing.
pub type Result<T> = std::result::Result<T, Vec<Diagnostic>>;
@ -150,9 +150,19 @@ impl PageLabelStyle {
let mut result = String::new();
let values = [
(1000, "M"), (900, "CM"), (500, "D"), (400, "CD"),
(100, "C"), (90, "XC"), (50, "L"), (40, "XL"),
(10, "X"), (9, "IX"), (5, "V"), (4, "IV"), (1, "I"),
(1000, "M"),
(900, "CM"),
(500, "D"),
(400, "CD"),
(100, "C"),
(90, "XC"),
(50, "L"),
(40, "XL"),
(10, "X"),
(9, "IX"),
(5, "V"),
(4, "IV"),
(1, "I"),
];
for (val, sym) in values {
@ -208,24 +218,26 @@ impl PageLabel {
fn parse(obj: &PdfObject) -> Option<Self> {
let dict = obj.as_dict()?;
let style = dict.get("S")
let style = dict
.get("S")
.and_then(|o| o.as_name())
.and_then(PageLabelStyle::from_name)
.unwrap_or(PageLabelStyle::Decimal);
let prefix = dict.get("P")
.and_then(|o| {
// Prefix can be either a String or a Name
o.as_string()
.and_then(|bytes| String::from_utf8(bytes.to_vec()).ok())
.or_else(|| o.as_name().map(|s| s.to_string()))
});
let prefix = dict.get("P").and_then(|o| {
// Prefix can be either a String or a Name
o.as_string()
.and_then(|bytes| String::from_utf8(bytes.to_vec()).ok())
.or_else(|| o.as_name().map(|s| s.to_string()))
});
let start = dict.get("St")
.and_then(|o| o.as_int())
.unwrap_or(1);
let start = dict.get("St").and_then(|o| o.as_int()).unwrap_or(1);
Some(PageLabel { style, prefix, start })
Some(PageLabel {
style,
prefix,
start,
})
}
/// Format a label for a given page index.
@ -332,7 +344,8 @@ impl PageLabelsTree {
///
/// Returns the label for the most recent key <= page_index.
pub fn get_label(&self, page_index: i64) -> Option<&PageLabel> {
self.get_label_with_start(page_index).map(|(label, _)| label)
self.get_label_with_start(page_index)
.map(|(label, _)| label)
}
/// Get all labels as a slice.
@ -402,7 +415,8 @@ impl Catalog {
/// Add a diagnostic to the catalog.
fn emit_diagnostic(&mut self, code: DiagCode, message: String) {
self.diagnostics.push(Diagnostic::with_dynamic_no_offset(code, message));
self.diagnostics
.push(Diagnostic::with_dynamic_no_offset(code, message));
}
}
@ -476,7 +490,10 @@ pub fn parse_catalog(resolver: &XrefResolver, root_ref: ObjRef) -> Result<Catalo
// Emit STRUCT_MISSING_KEY diagnostic and return empty catalog
diagnostics.push(Diagnostic::with_dynamic_no_offset(
DiagCode::StructMissingKey,
format!("STRUCT_MISSING_KEY: /Pages is not a reference (type: {})", other.type_name()),
format!(
"STRUCT_MISSING_KEY: /Pages is not a reference (type: {})",
other.type_name()
),
));
catalog.diagnostics = diagnostics;
return Ok(catalog);
@ -624,11 +641,26 @@ mod tests {
#[test]
fn test_page_label_style_from_name() {
assert_eq!(PageLabelStyle::from_name("D"), Some(PageLabelStyle::Decimal));
assert_eq!(PageLabelStyle::from_name("R"), Some(PageLabelStyle::RomanUppercase));
assert_eq!(PageLabelStyle::from_name("r"), Some(PageLabelStyle::RomanLowercase));
assert_eq!(PageLabelStyle::from_name("A"), Some(PageLabelStyle::LettersUppercase));
assert_eq!(PageLabelStyle::from_name("a"), Some(PageLabelStyle::LettersLowercase));
assert_eq!(
PageLabelStyle::from_name("D"),
Some(PageLabelStyle::Decimal)
);
assert_eq!(
PageLabelStyle::from_name("R"),
Some(PageLabelStyle::RomanUppercase)
);
assert_eq!(
PageLabelStyle::from_name("r"),
Some(PageLabelStyle::RomanLowercase)
);
assert_eq!(
PageLabelStyle::from_name("A"),
Some(PageLabelStyle::LettersUppercase)
);
assert_eq!(
PageLabelStyle::from_name("a"),
Some(PageLabelStyle::LettersLowercase)
);
assert_eq!(PageLabelStyle::from_name("X"), None);
}
@ -687,26 +719,56 @@ mod tests {
let mut tree = PageLabelsTree::new();
// Page 0-2: roman numerals (i, ii, iii)
tree.labels.push((0, PageLabel {
style: PageLabelStyle::RomanLowercase,
prefix: None,
start: 1,
}));
tree.labels.push((
0,
PageLabel {
style: PageLabelStyle::RomanLowercase,
prefix: None,
start: 1,
},
));
// Page 3+: decimal (1, 2, 3, ...)
tree.labels.push((3, PageLabel {
style: PageLabelStyle::Decimal,
prefix: None,
start: 1,
}));
tree.labels.push((
3,
PageLabel {
style: PageLabelStyle::Decimal,
prefix: None,
start: 1,
},
));
// Test lookups using format_absolute for correct relative indexing
assert_eq!(tree.get_label_with_start(0).map(|(l, start)| l.format_absolute(0, start)), Some("i".to_string()));
assert_eq!(tree.get_label_with_start(1).map(|(l, start)| l.format_absolute(1, start)), Some("ii".to_string()));
assert_eq!(tree.get_label_with_start(2).map(|(l, start)| l.format_absolute(2, start)), Some("iii".to_string()));
assert_eq!(tree.get_label_with_start(3).map(|(l, start)| l.format_absolute(3, start)), Some("1".to_string()));
assert_eq!(tree.get_label_with_start(4).map(|(l, start)| l.format_absolute(4, start)), Some("2".to_string()));
assert_eq!(tree.get_label_with_start(5).map(|(l, start)| l.format_absolute(5, start)), Some("3".to_string()));
assert_eq!(
tree.get_label_with_start(0)
.map(|(l, start)| l.format_absolute(0, start)),
Some("i".to_string())
);
assert_eq!(
tree.get_label_with_start(1)
.map(|(l, start)| l.format_absolute(1, start)),
Some("ii".to_string())
);
assert_eq!(
tree.get_label_with_start(2)
.map(|(l, start)| l.format_absolute(2, start)),
Some("iii".to_string())
);
assert_eq!(
tree.get_label_with_start(3)
.map(|(l, start)| l.format_absolute(3, start)),
Some("1".to_string())
);
assert_eq!(
tree.get_label_with_start(4)
.map(|(l, start)| l.format_absolute(4, start)),
Some("2".to_string())
);
assert_eq!(
tree.get_label_with_start(5)
.map(|(l, start)| l.format_absolute(5, start)),
Some("3".to_string())
);
}
#[test]
@ -782,7 +844,10 @@ mod tests {
// Empty catalog should have pages_ref = ObjRef::new(0, 0) from Default
assert_eq!(catalog.pages_ref, ObjRef::new(0, 0));
// Should have STRUCT_MISSING_KEY diagnostic
assert!(catalog.diagnostics.iter().any(|d| d.message.contains("STRUCT_MISSING_KEY")));
assert!(catalog
.diagnostics
.iter()
.any(|d| d.message.contains("STRUCT_MISSING_KEY")));
}
#[test]
@ -926,22 +991,40 @@ mod tests {
fn test_page_labels_tree_with_prefix() {
let mut tree = PageLabelsTree::new();
tree.labels.push((0, PageLabel {
style: PageLabelStyle::RomanLowercase,
prefix: Some("front-".to_string()),
start: 1,
}));
tree.labels.push((
0,
PageLabel {
style: PageLabelStyle::RomanLowercase,
prefix: Some("front-".to_string()),
start: 1,
},
));
tree.labels.push((3, PageLabel {
style: PageLabelStyle::Decimal,
prefix: None,
start: 1,
}));
tree.labels.push((
3,
PageLabel {
style: PageLabelStyle::Decimal,
prefix: None,
start: 1,
},
));
// Test with prefix using format_absolute for correct relative indexing
assert_eq!(tree.get_label_with_start(0).map(|(l, start)| l.format_absolute(0, start)), Some("front-i".to_string()));
assert_eq!(tree.get_label_with_start(1).map(|(l, start)| l.format_absolute(1, start)), Some("front-ii".to_string()));
assert_eq!(tree.get_label_with_start(3).map(|(l, start)| l.format_absolute(3, start)), Some("1".to_string()));
assert_eq!(
tree.get_label_with_start(0)
.map(|(l, start)| l.format_absolute(0, start)),
Some("front-i".to_string())
);
assert_eq!(
tree.get_label_with_start(1)
.map(|(l, start)| l.format_absolute(1, start)),
Some("front-ii".to_string())
);
assert_eq!(
tree.get_label_with_start(3)
.map(|(l, start)| l.format_absolute(3, start)),
Some("1".to_string())
);
}
// Phase 7.1.4 Coverage Check Tests
@ -955,9 +1038,18 @@ mod tests {
#[test]
fn test_reading_order_algorithm_from_str() {
assert_eq!(ReadingOrderAlgorithm::from_str("struct_tree"), Some(ReadingOrderAlgorithm::StructTree));
assert_eq!(ReadingOrderAlgorithm::from_str("xy_cut"), Some(ReadingOrderAlgorithm::XyCut));
assert_eq!(ReadingOrderAlgorithm::from_str("docstrum"), Some(ReadingOrderAlgorithm::Docstrum));
assert_eq!(
ReadingOrderAlgorithm::from_str("struct_tree"),
Some(ReadingOrderAlgorithm::StructTree)
);
assert_eq!(
ReadingOrderAlgorithm::from_str("xy_cut"),
Some(ReadingOrderAlgorithm::XyCut)
);
assert_eq!(
ReadingOrderAlgorithm::from_str("docstrum"),
Some(ReadingOrderAlgorithm::Docstrum)
);
assert_eq!(ReadingOrderAlgorithm::from_str("unknown"), None);
assert_eq!(ReadingOrderAlgorithm::from_str(""), None);
}
@ -1030,12 +1122,25 @@ mod proptests {
Just(PdfObject::Null),
any::<bool>().prop_map(PdfObject::Bool),
any::<i64>().prop_map(PdfObject::Integer),
any::<f64>().prop_map(|f| if f.is_finite() { PdfObject::Real(f) } else { PdfObject::Real(0.0) }),
any::<f64>().prop_map(|f| if f.is_finite() {
PdfObject::Real(f)
} else {
PdfObject::Real(0.0)
}),
prop::collection::vec(any::<u8>(), 0..100).prop_map(|v| PdfObject::String(Box::new(v))),
"[a-zA-Z]{1,20}".prop_map(|s| PdfObject::Name(intern(&s))),
prop::collection::vec(any::<u8>(), 0..100).prop_map(|bytes| {
// Try to create a valid name from the bytes
let name: String = bytes.iter().map(|&b| if b.is_ascii_alphanumeric() { b as char } else { '_' }).collect();
let name: String = bytes
.iter()
.map(|&b| {
if b.is_ascii_alphanumeric() {
b as char
} else {
'_'
}
})
.collect();
PdfObject::Name(intern(&name))
}),
]
@ -1043,14 +1148,13 @@ mod proptests {
/// Strategy to generate arbitrary dictionaries for catalog fuzzing.
fn arb_catalog_dict() -> impl Strategy<Value = indexmap::IndexMap<Arc<str>, PdfObject>> {
prop::collection::hash_map("[a-zA-Z]{1,10}", arb_pdf_object(0), 0..10)
.prop_map(|map| {
let mut index_map = indexmap::IndexMap::new();
for (k, v) in map {
index_map.insert(k.into(), v);
}
index_map
})
prop::collection::hash_map("[a-zA-Z]{1,10}", arb_pdf_object(0), 0..10).prop_map(|map| {
let mut index_map = indexmap::IndexMap::new();
for (k, v) in map {
index_map.insert(k.into(), v);
}
index_map
})
}
proptest! {

View file

@ -101,7 +101,12 @@ impl Diagnostic {
}
/// Create a new diagnostic with a specific code.
pub fn new_with_code(code: DiagCode, severity: Severity, phase: impl Into<String>, message: impl Into<String>) -> Self {
pub fn new_with_code(
code: DiagCode,
severity: Severity,
phase: impl Into<String>,
message: impl Into<String>,
) -> Self {
Diagnostic {
code,
severity,
@ -131,7 +136,11 @@ impl Diagnostic {
}
/// Create an error diagnostic with a specific code.
pub fn error_with_code(code: DiagCode, phase: impl Into<String>, message: impl Into<String>) -> Self {
pub fn error_with_code(
code: DiagCode,
phase: impl Into<String>,
message: impl Into<String>,
) -> Self {
Diagnostic {
code,
severity: Severity::Error,

View file

@ -3,7 +3,7 @@
//! This module provides the lexer that converts raw PDF byte sequences into tokens.
//! PDF is byte-oriented; position tracking is byte-level, not character-level.
use crate::diagnostics::{Diagnostic as Diag, DiagCode};
use crate::diagnostics::{DiagCode, Diagnostic as Diag};
use std::str::FromStr;
/// Token produced by the PDF lexer.
@ -386,7 +386,10 @@ impl<'a> Lexer<'a> {
/// Internal: Skip whitespace and comments.
fn skip_whitespace_and_comments(&mut self) {
loop {
let had_whitespace = self.bytes.first().map_or(false, |&b| Self::is_pdf_whitespace(b));
let had_whitespace = self
.bytes
.first()
.map_or(false, |&b| Self::is_pdf_whitespace(b));
let had_comment = self.bytes.first() == Some(&b'%');
self.consume_whitespace();
@ -398,7 +401,11 @@ impl<'a> Lexer<'a> {
}
// If we consumed a comment, there might be more whitespace after it
// If we consumed whitespace, there might be a comment after it
if self.bytes.first().map_or(true, |&b| !Self::is_pdf_whitespace(b) && b != b'%') {
if self
.bytes
.first()
.map_or(true, |&b| !Self::is_pdf_whitespace(b) && b != b'%')
{
break;
}
}
@ -411,7 +418,9 @@ impl<'a> Lexer<'a> {
// Check for "true"
if self.bytes.starts_with(b"true") {
let next_after = self.bytes.get(4);
if next_after.map_or(true, |&b| Self::is_pdf_whitespace(b) || Self::is_pdf_delimiter(b)) {
if next_after.map_or(true, |&b| {
Self::is_pdf_whitespace(b) || Self::is_pdf_delimiter(b)
}) {
self.advance(4);
return Some(Token::Bool(true));
}
@ -419,7 +428,9 @@ impl<'a> Lexer<'a> {
// Check for "trailer"
if self.bytes.starts_with(b"trailer") {
let next_after = self.bytes.get(7);
if next_after.map_or(true, |&b| Self::is_pdf_whitespace(b) || Self::is_pdf_delimiter(b)) {
if next_after.map_or(true, |&b| {
Self::is_pdf_whitespace(b) || Self::is_pdf_delimiter(b)
}) {
self.advance(7);
return Some(Token::Keyword(b"trailer".to_vec()));
}
@ -432,7 +443,9 @@ impl<'a> Lexer<'a> {
// Check for "false"
if self.bytes.starts_with(b"false") {
let next_after = self.bytes.get(5);
if next_after.map_or(true, |&b| Self::is_pdf_whitespace(b) || Self::is_pdf_delimiter(b)) {
if next_after.map_or(true, |&b| {
Self::is_pdf_whitespace(b) || Self::is_pdf_delimiter(b)
}) {
self.advance(5);
return Some(Token::Bool(false));
}
@ -445,7 +458,9 @@ impl<'a> Lexer<'a> {
// Check for "xref"
if self.bytes.starts_with(b"xref") {
let next_after = self.bytes.get(4);
if next_after.map_or(true, |&b| Self::is_pdf_whitespace(b) || Self::is_pdf_delimiter(b)) {
if next_after.map_or(true, |&b| {
Self::is_pdf_whitespace(b) || Self::is_pdf_delimiter(b)
}) {
self.advance(4);
return Some(Token::Keyword(b"xref".to_vec()));
}
@ -458,7 +473,9 @@ impl<'a> Lexer<'a> {
// Check for "%%EOF" - the PDF end-of-file marker
if self.bytes.starts_with(b"%%EOF") {
let next_after = self.bytes.get(5);
if next_after.map_or(true, |&b| Self::is_pdf_whitespace(b) || Self::is_pdf_delimiter(b)) {
if next_after.map_or(true, |&b| {
Self::is_pdf_whitespace(b) || Self::is_pdf_delimiter(b)
}) {
self.advance(5);
return Some(Token::Keyword(b"%%EOF".to_vec()));
}
@ -609,7 +626,10 @@ impl<'a> Lexer<'a> {
self.diagnostics.push(Diag::with_dynamic(
DiagCode::StructIntegerOverflow,
start as u64,
format!("Integer '{}' exceeds i64 range, clamped to i64::MAX", num_str),
format!(
"Integer '{}' exceeds i64 range, clamped to i64::MAX",
num_str
),
));
self.advance(consumed);
Some(Token::Integer(i64::MAX))
@ -959,7 +979,9 @@ impl<'a> Lexer<'a> {
// Check for "stream"
if self.bytes.starts_with(b"stream") {
let next_after = self.bytes.get(6);
if next_after.map_or(true, |&b| Self::is_pdf_whitespace(b) || Self::is_pdf_delimiter(b)) {
if next_after.map_or(true, |&b| {
Self::is_pdf_whitespace(b) || Self::is_pdf_delimiter(b)
}) {
self.advance(6);
// Validate stream header: must be followed by \n or \r\n
// PDF spec 7.3.8.1: stream keyword must be followed by \n or \r\n
@ -996,7 +1018,9 @@ impl<'a> Lexer<'a> {
// Check for "startxref"
if self.bytes.starts_with(b"startxref") {
let next_after = self.bytes.get(10);
if next_after.map_or(true, |&b| Self::is_pdf_whitespace(b) || Self::is_pdf_delimiter(b)) {
if next_after.map_or(true, |&b| {
Self::is_pdf_whitespace(b) || Self::is_pdf_delimiter(b)
}) {
self.advance(10);
return Some(Token::Keyword(b"startxref".to_vec()));
}
@ -1009,7 +1033,9 @@ impl<'a> Lexer<'a> {
// Check for "endstream"
if self.bytes.starts_with(b"endstream") {
let next_after = self.bytes.get(9);
if next_after.map_or(true, |&b| Self::is_pdf_whitespace(b) || Self::is_pdf_delimiter(b)) {
if next_after.map_or(true, |&b| {
Self::is_pdf_whitespace(b) || Self::is_pdf_delimiter(b)
}) {
self.advance(9);
return Some(Token::EndStream);
}
@ -1017,7 +1043,9 @@ impl<'a> Lexer<'a> {
// Check for "endobj"
if self.bytes.starts_with(b"endobj") {
let next_after = self.bytes.get(7);
if next_after.map_or(true, |&b| Self::is_pdf_whitespace(b) || Self::is_pdf_delimiter(b)) {
if next_after.map_or(true, |&b| {
Self::is_pdf_whitespace(b) || Self::is_pdf_delimiter(b)
}) {
self.advance(7);
return Some(Token::EndObj);
}
@ -1030,7 +1058,9 @@ impl<'a> Lexer<'a> {
// Check for "obj"
if self.bytes.starts_with(b"obj") {
let next_after = self.bytes.get(3);
if next_after.map_or(true, |&b| Self::is_pdf_whitespace(b) || Self::is_pdf_delimiter(b)) {
if next_after.map_or(true, |&b| {
Self::is_pdf_whitespace(b) || Self::is_pdf_delimiter(b)
}) {
self.advance(3);
return Some(Token::Obj);
}
@ -1042,7 +1072,9 @@ impl<'a> Lexer<'a> {
fn lex_r_keyword(&mut self) -> Option<Token> {
// Check for "R" (indirect reference)
let next_after = self.bytes.get(1);
if next_after.map_or(true, |&b| Self::is_pdf_whitespace(b) || Self::is_pdf_delimiter(b)) {
if next_after.map_or(true, |&b| {
Self::is_pdf_whitespace(b) || Self::is_pdf_delimiter(b)
}) {
self.advance(1);
Some(Token::IndirectRef)
} else {
@ -1054,7 +1086,9 @@ impl<'a> Lexer<'a> {
// Check for "null"
if self.bytes.starts_with(b"null") {
let next_after = self.bytes.get(4);
if next_after.map_or(true, |&b| Self::is_pdf_whitespace(b) || Self::is_pdf_delimiter(b)) {
if next_after.map_or(true, |&b| {
Self::is_pdf_whitespace(b) || Self::is_pdf_delimiter(b)
}) {
self.advance(4);
return Some(Token::Null);
}
@ -1205,8 +1239,13 @@ mod tests {
let mut lexer = Lexer::new(b"stream body");
assert_eq!(lexer.next_token(), Some(Token::Stream));
let diags = lexer.take_diagnostics();
assert!(!diags.is_empty(), "Should emit diagnostic for stream without proper line ending");
assert!(diags.iter().any(|d| d.code == DiagCode::StructInvalidStreamHeader));
assert!(
!diags.is_empty(),
"Should emit diagnostic for stream without proper line ending"
);
assert!(diags
.iter()
.any(|d| d.code == DiagCode::StructInvalidStreamHeader));
}
#[test]
@ -1247,7 +1286,10 @@ mod tests {
#[test]
fn string_literal_simple_text() {
let mut lexer = Lexer::new(b"(Hello World)");
assert_eq!(lexer.next_token(), Some(Token::String(b"Hello World".to_vec())));
assert_eq!(
lexer.next_token(),
Some(Token::String(b"Hello World".to_vec()))
);
assert_eq!(lexer.next_token(), Some(Token::Eof));
}
@ -1274,14 +1316,20 @@ mod tests {
#[test]
fn string_literal_escape_tab() {
let mut lexer = Lexer::new(b"(col1\\tcol2)");
assert_eq!(lexer.next_token(), Some(Token::String(b"col1\tcol2".to_vec())));
assert_eq!(
lexer.next_token(),
Some(Token::String(b"col1\tcol2".to_vec()))
);
assert_eq!(lexer.next_token(), Some(Token::Eof));
}
#[test]
fn string_literal_escape_backspace() {
let mut lexer = Lexer::new(b"(abc\\bdef)");
assert_eq!(lexer.next_token(), Some(Token::String(b"abc\x08def".to_vec())));
assert_eq!(
lexer.next_token(),
Some(Token::String(b"abc\x08def".to_vec()))
);
assert_eq!(lexer.next_token(), Some(Token::Eof));
}
@ -1298,21 +1346,30 @@ mod tests {
#[test]
fn string_literal_escape_backslash() {
let mut lexer = Lexer::new(b"(path\\\\file)");
assert_eq!(lexer.next_token(), Some(Token::String(b"path\\file".to_vec())));
assert_eq!(
lexer.next_token(),
Some(Token::String(b"path\\file".to_vec()))
);
assert_eq!(lexer.next_token(), Some(Token::Eof));
}
#[test]
fn string_literal_escape_left_paren() {
let mut lexer = Lexer::new(b"(\\(nested))");
assert_eq!(lexer.next_token(), Some(Token::String(b"(nested)".to_vec())));
assert_eq!(
lexer.next_token(),
Some(Token::String(b"(nested)".to_vec()))
);
assert_eq!(lexer.next_token(), Some(Token::Eof));
}
#[test]
fn string_literal_escape_right_paren() {
let mut lexer = Lexer::new(b"(\\)not_end)");
assert_eq!(lexer.next_token(), Some(Token::String(b")not_end".to_vec())));
assert_eq!(
lexer.next_token(),
Some(Token::String(b")not_end".to_vec()))
);
assert_eq!(lexer.next_token(), Some(Token::Eof));
}
@ -1340,7 +1397,10 @@ mod tests {
#[test]
fn string_literal_octal_escape_non_octal_following() {
let mut lexer = Lexer::new(b"(abc\\10A)");
assert_eq!(lexer.next_token(), Some(Token::String(b"abc\x08A".to_vec())));
assert_eq!(
lexer.next_token(),
Some(Token::String(b"abc\x08A".to_vec()))
);
assert_eq!(lexer.next_token(), Some(Token::Eof));
}
@ -1443,7 +1503,10 @@ mod tests {
fn hex_string_mixed_case() {
let mut lexer = Lexer::new(b"<aBcD>");
// aB=0xAB, cD=0xCD
assert_eq!(lexer.next_token(), Some(Token::String(b"\xAB\xCD".to_vec())));
assert_eq!(
lexer.next_token(),
Some(Token::String(b"\xAB\xCD".to_vec()))
);
assert_eq!(lexer.next_token(), Some(Token::Eof));
}
@ -1459,7 +1522,10 @@ mod tests {
fn hex_string_odd_length_multiple_nibbles() {
let mut lexer = Lexer::new(b"<48657>");
// 48=0x48, 65=0x65, 7=0x70 (dangling nibble becomes HIGH nibble with LOW nibble 0)
assert_eq!(lexer.next_token(), Some(Token::String(b"\x48\x65\x70".to_vec())));
assert_eq!(
lexer.next_token(),
Some(Token::String(b"\x48\x65\x70".to_vec()))
);
assert_eq!(lexer.next_token(), Some(Token::Eof));
}
@ -1501,7 +1567,10 @@ mod tests {
#[test]
fn hex_string_all_zero_bytes() {
let mut lexer = Lexer::new(b"<000000>");
assert_eq!(lexer.next_token(), Some(Token::String(b"\x00\x00\x00".to_vec())));
assert_eq!(
lexer.next_token(),
Some(Token::String(b"\x00\x00\x00".to_vec()))
);
assert_eq!(lexer.next_token(), Some(Token::Eof));
}
@ -1579,15 +1648,16 @@ mod tests {
use proptest::prelude::*;
// Generate random byte sequences that start with < (but not << to avoid dict start)
let test_strategy = prop::collection::vec(prop::num::u8::ANY, 0..1000).prop_map(|mut bytes| {
// Ensure the input starts with '<' but NOT '<<'
// Insert '<' at the start, and ensure the second byte is not '<'
bytes.insert(0, b'<');
if bytes.len() > 1 && bytes[1] == b'<' {
bytes[1] = b'>'; // Change second byte to something non-'<'
}
bytes
});
let test_strategy =
prop::collection::vec(prop::num::u8::ANY, 0..1000).prop_map(|mut bytes| {
// Ensure the input starts with '<' but NOT '<<'
// Insert '<' at the start, and ensure the second byte is not '<'
bytes.insert(0, b'<');
if bytes.len() > 1 && bytes[1] == b'<' {
bytes[1] = b'>'; // Change second byte to something non-'<'
}
bytes
});
proptest!(|(bytes in test_strategy)| {
// This should never panic
@ -1621,9 +1691,8 @@ mod tests {
}
// Generate valid hex strings and test roundtrip
let test_strategy = prop::collection::vec(prop::num::u8::ANY, 0..100).prop_map(|bytes| {
encode_hex_string(&bytes)
});
let test_strategy = prop::collection::vec(prop::num::u8::ANY, 0..100)
.prop_map(|bytes| encode_hex_string(&bytes));
proptest!(|(encoded in test_strategy)| {
let mut lexer = Lexer::new(&encoded);
@ -1650,11 +1719,12 @@ mod tests {
fn proptest_string_never_panics_on_random_bytes() {
use proptest::prelude::*;
let test_strategy = prop::collection::vec(prop::num::u8::ANY, 0..1000).prop_map(|mut bytes| {
// Ensure the input starts with '(' to trigger string lexing
bytes.insert(0, b'(');
bytes
});
let test_strategy =
prop::collection::vec(prop::num::u8::ANY, 0..1000).prop_map(|mut bytes| {
// Ensure the input starts with '(' to trigger string lexing
bytes.insert(0, b'(');
bytes
});
proptest!(|(bytes in test_strategy)| {
// This should never panic
@ -1670,14 +1740,17 @@ mod tests {
// Strategy for generating valid literal strings
// We generate bytes that can appear in a PDF string and wrap them in parens
let test_strategy = prop::collection::vec(
prop::num::u8::ANY
.prop_filter("avoid unprintable and special chars that make testing hard", |&b| {
prop::num::u8::ANY.prop_filter(
"avoid unprintable and special chars that make testing hard",
|&b| {
// Allow most bytes, but filter out some that make roundtripping difficult
// We include parens but balance them manually
!matches!(b, 0x00 | 0x01..=0x08 | 0x0B | 0x0E..=0x1F)
}),
},
),
0..100,
).prop_map(|mut bytes| {
)
.prop_map(|mut bytes| {
// Balance parentheses: for every '(' we add a ')'
let mut depth = 0i32;
let mut result = Vec::new();
@ -1814,7 +1887,10 @@ mod tests {
panic!("Expected Name token");
}
let diags = lexer.take_diagnostics();
assert!(diags.is_empty(), "Expected no diagnostics for exactly 127 bytes");
assert!(
diags.is_empty(),
"Expected no diagnostics for exactly 127 bytes"
);
}
#[test]
@ -1834,7 +1910,10 @@ mod tests {
panic!("Expected Name token");
}
let diags = lexer.take_diagnostics();
assert!(diags.is_empty(), "Expected no diagnostics: 124 A's + #41 = 127 raw bytes");
assert!(
diags.is_empty(),
"Expected no diagnostics: 124 A's + #41 = 127 raw bytes"
);
}
#[test]
@ -1964,11 +2043,12 @@ mod tests {
fn name_proptest_never_panics_on_random_bytes() {
use proptest::prelude::*;
let test_strategy = prop::collection::vec(prop::num::u8::ANY, 0..1000).prop_map(|mut bytes| {
// Ensure the input starts with '/' to trigger name lexing
bytes.insert(0, b'/');
bytes
});
let test_strategy =
prop::collection::vec(prop::num::u8::ANY, 0..1000).prop_map(|mut bytes| {
// Ensure the input starts with '/' to trigger name lexing
bytes.insert(0, b'/');
bytes
});
proptest!(|(bytes in test_strategy)| {
// This should never panic
@ -1981,10 +2061,11 @@ mod tests {
fn name_proptest_always_produces_valid_token() {
use proptest::prelude::*;
let test_strategy = prop::collection::vec(prop::num::u8::ANY, 0..1000).prop_map(|mut bytes| {
bytes.insert(0, b'/');
bytes
});
let test_strategy =
prop::collection::vec(prop::num::u8::ANY, 0..1000).prop_map(|mut bytes| {
bytes.insert(0, b'/');
bytes
});
proptest!(|(bytes in test_strategy)| {
let mut lexer = Lexer::new(&bytes);
@ -2142,7 +2223,9 @@ mod tests {
assert!(matches!(token, Some(Token::Integer(0)) | Some(Token::Null)));
let diags = lexer.take_diagnostics();
assert!(!diags.is_empty());
assert!(diags.iter().any(|d| d.code == DiagCode::StructInvalidNumber));
assert!(diags
.iter()
.any(|d| d.code == DiagCode::StructInvalidNumber));
}
#[test]
@ -2159,10 +2242,15 @@ mod tests {
let mut lexer = Lexer::new(b"1.2.3");
let token = lexer.next_token();
// Should consume up to second dot and emit diagnostic
assert!(matches!(token, Some(Token::Integer(0)) | Some(Token::Real(_))));
assert!(matches!(
token,
Some(Token::Integer(0)) | Some(Token::Real(_))
));
let diags = lexer.take_diagnostics();
assert!(!diags.is_empty());
assert!(diags.iter().any(|d| d.code == DiagCode::StructInvalidNumber));
assert!(diags
.iter()
.any(|d| d.code == DiagCode::StructInvalidNumber));
}
#[test]
@ -2173,7 +2261,9 @@ mod tests {
assert!(matches!(token, Some(Token::Integer(0)) | Some(Token::Null)));
let diags = lexer.take_diagnostics();
assert!(!diags.is_empty());
assert!(diags.iter().any(|d| d.code == DiagCode::StructInvalidNumber));
assert!(diags
.iter()
.any(|d| d.code == DiagCode::StructInvalidNumber));
}
#[test]
@ -2191,16 +2281,20 @@ mod tests {
use proptest::prelude::*;
// Generate random byte sequences starting with numeric characters
let test_strategy = prop::collection::vec(prop::num::u8::ANY, 0..1000).prop_map(|mut bytes| {
// Ensure the input starts with a numeric-start character (+, -, ., 0-9)
if bytes.is_empty() {
bytes.push(b'1');
} else {
let numeric_starts = [b'+', b'-', b'.', b'0', b'1', b'2', b'3', b'4', b'5', b'6', b'7', b'8', b'9'];
bytes[0] = numeric_starts[bytes[0] as usize % numeric_starts.len()];
}
bytes
});
let test_strategy =
prop::collection::vec(prop::num::u8::ANY, 0..1000).prop_map(|mut bytes| {
// Ensure the input starts with a numeric-start character (+, -, ., 0-9)
if bytes.is_empty() {
bytes.push(b'1');
} else {
let numeric_starts = [
b'+', b'-', b'.', b'0', b'1', b'2', b'3', b'4', b'5', b'6', b'7', b'8',
b'9',
];
bytes[0] = numeric_starts[bytes[0] as usize % numeric_starts.len()];
}
bytes
});
proptest!(|(bytes in test_strategy)| {
// This should never panic

View file

@ -17,9 +17,9 @@
//!
//! Coverage = claimed_mcids / total_mcids
use crate::parser::object::PdfObject;
use crate::diagnostics::{Diagnostic, DiagCode};
use crate::diagnostics::{DiagCode, Diagnostic};
use crate::parser::lexer::Lexer;
use crate::parser::object::PdfObject;
use std::collections::HashSet;
/// Result type for marked content operations.
@ -81,7 +81,8 @@ impl McidTracker {
/// Add a diagnostic.
fn emit_diagnostic(&mut self, code: DiagCode, message: String) {
self.diagnostics.push(Diagnostic::with_dynamic_no_offset(code, message));
self.diagnostics
.push(Diagnostic::with_dynamic_no_offset(code, message));
}
/// Get all diagnostics emitted during tracking.
@ -184,7 +185,11 @@ impl CoverageResult {
/// # Returns
///
/// A `CoverageResult` containing the coverage ratio and fallback decision.
pub fn compute_coverage(page_index: usize, total_mcids: usize, claimed_mcids: usize) -> CoverageResult {
pub fn compute_coverage(
page_index: usize,
total_mcids: usize,
claimed_mcids: usize,
) -> CoverageResult {
CoverageResult::new(page_index, total_mcids, claimed_mcids)
}
@ -412,7 +417,10 @@ mod tests {
assert_eq!(result.claimed_mcids, 0);
assert_eq!(result.coverage, 0.0);
assert!(result.should_fallback); // No MCIDs = fallback
assert!(result.fallback_diagnostic().unwrap().contains("no marked-content sequences"));
assert!(result
.fallback_diagnostic()
.unwrap()
.contains("no marked-content sequences"));
}
#[test]

View file

@ -8,12 +8,12 @@
//! - BDC /Tag <<props>> or BDC /Tag /PropName: begin marked content with properties
//! - EMC: end marked content (pop top frame)
use crate::parser::object::{PdfObject, ObjRef};
use crate::diagnostics::{DiagCode, Diagnostic};
use crate::parser::marked_content_stack::{MarkedContentFrame, MarkedContentStack};
use crate::parser::object::{ObjRef, PdfObject};
use crate::parser::resources::ResourceDict;
use crate::parser::marked_content_stack::{MarkedContentStack, MarkedContentFrame};
use crate::diagnostics::{Diagnostic, DiagCode};
use std::sync::Arc;
use indexmap::IndexMap;
use std::sync::Arc;
/// Parse BMC operator (begin marked content).
///
@ -245,10 +245,9 @@ mod tests {
fn test_parse_bdc_with_property_name_found() {
let mut stack = MarkedContentStack::new();
let mut resources = ResourceDict::new();
resources.properties.insert(
Arc::from("MyProps"),
ObjRef::new(10, 0),
);
resources
.properties
.insert(Arc::from("MyProps"), ObjRef::new(10, 0));
// Property name resolution requires full resolver, so this returns None
assert!(parse_bdc(
@ -366,7 +365,12 @@ mod tests {
// Outer BDC with MCID
let mut props1 = IndexMap::new();
props1.insert(intern("/MCID"), PdfObject::Integer(1));
parse_bdc(&mut stack, Arc::from("P"), &PdfObject::Dict(Box::new(props1)), &ResourceDict::new());
parse_bdc(
&mut stack,
Arc::from("P"),
&PdfObject::Dict(Box::new(props1)),
&ResourceDict::new(),
);
// Inner BMC
parse_bmc(&mut stack, Arc::from("Span"));
@ -400,7 +404,12 @@ mod tests {
let mut props = IndexMap::new();
props.insert(intern("/MCID"), PdfObject::Integer(5));
parse_bdc(&mut stack, Arc::from("/P"), &PdfObject::Dict(Box::new(props)), &ResourceDict::new());
parse_bdc(
&mut stack,
Arc::from("/P"),
&PdfObject::Dict(Box::new(props)),
&ResourceDict::new(),
);
assert_eq!(stack.depth(), 1);
assert_eq!(stack.innermost_frame().unwrap().tag, "/P");

View file

@ -6,7 +6,7 @@
//! Per PDF spec section 14.5, the marked-content stack is independent of the
//! graphics state stack — q/Q operators do not affect it.
use crate::diagnostics::{Diagnostic, DiagCode};
use crate::diagnostics::{DiagCode, Diagnostic};
/// Maximum depth of marked-content stack (prevents stack overflow).
const MAX_MC_DEPTH: usize = 64;
@ -73,7 +73,11 @@ impl MarkedContentStack {
if self.stack.len() >= MAX_MC_DEPTH {
self.diagnostics.push(Diagnostic::with_dynamic_no_offset(
DiagCode::MarkedContentDepthExceeded,
format!("Marked-content stack depth {} exceeds limit {}", self.stack.len() + 1, MAX_MC_DEPTH),
format!(
"Marked-content stack depth {} exceeds limit {}",
self.stack.len() + 1,
MAX_MC_DEPTH
),
));
false
} else {
@ -89,7 +93,11 @@ impl MarkedContentStack {
if self.stack.len() >= MAX_MC_DEPTH {
self.diagnostics.push(Diagnostic::with_dynamic_no_offset(
DiagCode::MarkedContentDepthExceeded,
format!("Marked-content stack depth {} exceeds limit {}", self.stack.len() + 1, MAX_MC_DEPTH),
format!(
"Marked-content stack depth {} exceeds limit {}",
self.stack.len() + 1,
MAX_MC_DEPTH
),
));
false
} else {
@ -117,9 +125,7 @@ impl MarkedContentStack {
///
/// Returns the MCID of the topmost frame that has one.
pub fn innermost_mcid(&self) -> Option<u32> {
self.stack.iter()
.rev()
.find_map(|frame| frame.mcid)
self.stack.iter().rev().find_map(|frame| frame.mcid)
}
/// Get the innermost (top) frame, if any.
@ -247,7 +253,10 @@ mod tests {
assert!(!stack.push_bmc("overflow".to_string()));
assert_eq!(stack.depth(), MAX_MC_DEPTH);
assert!(!stack.diagnostics().is_empty());
assert_eq!(stack.diagnostics().last().unwrap().code, DiagCode::MarkedContentDepthExceeded);
assert_eq!(
stack.diagnostics().last().unwrap().code,
DiagCode::MarkedContentDepthExceeded
);
}
#[test]

View file

@ -2,49 +2,50 @@
//!
//! This module provides the lexer and object parser for reading PDF documents.
pub mod catalog;
pub mod diagnostic;
pub mod lexer;
pub mod marked_content;
pub mod marked_content_operators;
pub mod marked_content_stack;
pub mod object;
pub mod objstm;
pub mod xref;
pub mod catalog;
pub mod stream;
pub mod secrets;
pub mod pages;
pub mod outline;
pub mod resources;
pub mod ocg;
pub mod outline;
pub mod pages;
pub mod resources;
pub mod secrets;
pub mod stream;
pub mod struct_tree;
pub mod marked_content;
pub mod marked_content_stack;
pub mod marked_content_operators;
pub mod xref;
// Re-export from the unified diagnostics module (Phase 1.6)
pub use crate::diagnostics::{Diagnostic, Severity, DiagCode, ObjRef};
pub use object::{PdfObject};
pub use objstm::{ObjectStmParser, ObjStmCacheEntry, ObjStmResult, ObjStmError};
pub use xref::{
XrefResolver, XrefEntry, ResolveError, ResolveResult, XrefSection,
parse_traditional_xref, parse_xref_stream, merge_hybrid, is_hybrid_trailer,
LinearizationInfo, detect_linearization, load_xref_linearized, merge_linearized_xrefs,
load_xref_with_prev_chain,
};
pub use catalog::{Catalog, MarkInfo, PageLabel, PageLabelsTree, PageLabelStyle, ReadingOrderAlgorithm, parse_catalog};
pub use ocg::{OcProperties, OcGroup, Ocmd, OcmdPolicy, BaseState, parse_oc_properties};
pub use resources::{ResourceDict, merge_resources, extract_resources};
pub use pages::{PageDict, flatten_page_tree, DEFAULT_MEDIABOX};
pub use struct_tree::{
StructureType, StructElemNode, StructTreeRoot, RoleMap, Kid,
BlockKind, MappingResult, ParentTreeResolver, ParentTreeEntry,
parse_struct_tree, structure_type_to_block_kind, map_element_to_block, is_artifact,
check_coverage_for_pages, CoverageCheckResult,
pub use crate::diagnostics::{DiagCode, Diagnostic, ObjRef, Severity};
pub use catalog::{
parse_catalog, Catalog, MarkInfo, PageLabel, PageLabelStyle, PageLabelsTree,
ReadingOrderAlgorithm,
};
pub use marked_content::{
McidTracker, CoverageResult, compute_coverage, compute_coverage_from_sets,
compute_coverage, compute_coverage_from_sets, CoverageResult, McidTracker,
};
pub use marked_content_operators::{parse_bdc, parse_bmc, parse_emc};
pub use marked_content_stack::{MarkedContentFrame, MarkedContentStack};
pub use marked_content_operators::{parse_bmc, parse_bdc, parse_emc};
pub use object::PdfObject;
pub use objstm::{ObjStmCacheEntry, ObjStmError, ObjStmResult, ObjectStmParser};
pub use ocg::{parse_oc_properties, BaseState, OcGroup, OcProperties, Ocmd, OcmdPolicy};
pub use pages::{flatten_page_tree, PageDict, DEFAULT_MEDIABOX};
pub use resources::{extract_resources, merge_resources, ResourceDict};
pub use stream::{
StreamDecoder, FlateDecoder, ASCII85Decoder, ASCIIHexDecoder, CryptDecoder, PassthroughDecoder,
normalize_filter_name, get_decoder, FilterError, DEFAULT_MAX_DECOMPRESS_BYTES,
get_decoder, normalize_filter_name, ASCII85Decoder, ASCIIHexDecoder, CryptDecoder, FilterError,
FlateDecoder, PassthroughDecoder, StreamDecoder, DEFAULT_MAX_DECOMPRESS_BYTES,
};
pub use struct_tree::{
check_coverage_for_pages, is_artifact, map_element_to_block, parse_struct_tree,
structure_type_to_block_kind, BlockKind, CoverageCheckResult, Kid, MappingResult,
ParentTreeEntry, ParentTreeResolver, RoleMap, StructElemNode, StructTreeRoot, StructureType,
};
pub use xref::{
detect_linearization, is_hybrid_trailer, load_xref_linearized, load_xref_with_prev_chain,
merge_hybrid, merge_linearized_xrefs, parse_traditional_xref, parse_xref_stream,
LinearizationInfo, ResolveError, ResolveResult, XrefEntry, XrefResolver, XrefSection,
};

View file

@ -2,8 +2,8 @@
//!
//! This module defines the core PDF object types and the object reference type.
pub mod types;
pub mod parser;
pub mod types;
pub use types::{ObjRef, PdfObject, PdfDict, PdfStream, PdfIndirect, intern};
pub use parser::ObjectParser;
pub use types::{intern, ObjRef, PdfDict, PdfIndirect, PdfObject, PdfStream};

View file

@ -3,9 +3,9 @@
//! This module provides the parser that converts tokens from the lexer
//! into PDF objects.
use super::types::{intern, ObjRef, PdfDict, PdfObject, PdfStream, PdfIndirect};
use super::types::{intern, ObjRef, PdfDict, PdfIndirect, PdfObject, PdfStream};
use crate::diagnostics::{DiagCode, Diagnostic as Diag};
use crate::parser::lexer::{Lexer, Token};
use crate::diagnostics::{Diagnostic as Diag, DiagCode};
/// Maximum nesting depth for dictionaries and arrays.
///
@ -233,7 +233,10 @@ impl<'a> ObjectParser<'a> {
// Missing value - insert PdfNull
self.diagnostics.push(Diag::with_dynamic_no_offset(
DiagCode::StructInvalidDictValue,
format!("Dictionary key '{}' has no value, inserting null", key),
format!(
"Dictionary key '{}' has no value, inserting null",
key
),
));
dict.insert(key, PdfObject::Null);
break; // End of dict
@ -258,7 +261,10 @@ impl<'a> ObjectParser<'a> {
));
// Skip the invalid token and the next token (would-be value)
let _ = self.lexer.next_token();
if !matches!(self.lexer.peek_token(), Some(Token::DictEnd) | Some(Token::Eof) | None) {
if !matches!(
self.lexer.peek_token(),
Some(Token::DictEnd) | Some(Token::Eof) | None
) {
let _ = self.lexer.next_token();
}
expecting_key = true;
@ -281,13 +287,18 @@ impl<'a> ObjectParser<'a> {
let offset = self.lexer.position();
// Try to get /Length from the dict
let len_hint = dict.get("Length").and_then(|obj| obj.as_int()).map(|i| i as u64);
let len_hint = dict
.get("Length")
.and_then(|obj| obj.as_int())
.map(|i| i as u64);
// Skip the stream body
self.skip_stream_body(len_hint);
// Parse the stream object
return Some(PdfObject::Stream(Box::new(PdfStream::new(dict, offset, len_hint))));
return Some(PdfObject::Stream(Box::new(PdfStream::new(
dict, offset, len_hint,
))));
}
Some(PdfObject::Dict(Box::new(dict)))
@ -315,7 +326,10 @@ impl<'a> ObjectParser<'a> {
if actual_skipped < len_usize {
self.diagnostics.push(Diag::with_dynamic_no_offset(
DiagCode::StructUnexpectedEof,
format!("Stream truncated at EOF: expected {} bytes, got {}", len, actual_skipped),
format!(
"Stream truncated at EOF: expected {} bytes, got {}",
len, actual_skipped
),
));
}
} else {
@ -337,7 +351,10 @@ impl<'a> ObjectParser<'a> {
Some(other) => {
self.diagnostics.push(Diag::with_dynamic_no_offset(
DiagCode::StructUnexpectedByte,
format!("Expected endstream keyword after stream body, found {:?}", other),
format!(
"Expected endstream keyword after stream body, found {:?}",
other
),
));
// Try to recover by scanning forward for EndStream
self.scan_to_endstream();
@ -639,7 +656,10 @@ impl<'a> ObjectParser<'a> {
}
// Now we're at the end of the first integer (object number)
// Skip the digits of the object number (and optional minus sign)
while scan_back > 0 && (remaining[scan_back - 1].is_ascii_digit() || remaining[scan_back - 1] == b'-') {
while scan_back > 0
&& (remaining[scan_back - 1].is_ascii_digit()
|| remaining[scan_back - 1] == b'-')
{
scan_back -= 1;
}
// scan_back now points to the start of the object number
@ -738,11 +758,14 @@ mod tests {
fn test_parse_array_of_integers() {
let mut parser = ObjectParser::new(b"[ 1 2 3 ]");
let obj = parser.parse_direct_object();
assert_eq!(obj, Some(PdfObject::Array(Box::new(vec![
PdfObject::Integer(1),
PdfObject::Integer(2),
PdfObject::Integer(3),
]))));
assert_eq!(
obj,
Some(PdfObject::Array(Box::new(vec![
PdfObject::Integer(1),
PdfObject::Integer(2),
PdfObject::Integer(3),
])))
);
}
#[test]
@ -825,7 +848,9 @@ mod tests {
assert_eq!(dict.len(), 1);
assert_eq!(dict.get("Type"), Some(&PdfObject::Null));
let diags = parser.take_diagnostics();
assert!(diags.iter().any(|d| d.code == DiagCode::StructInvalidDictValue));
assert!(diags
.iter()
.any(|d| d.code == DiagCode::StructInvalidDictValue));
} else {
panic!("Expected dict, got {:?}", obj);
}
@ -838,7 +863,9 @@ mod tests {
if let Some(PdfObject::Dict(dict)) = obj {
assert_eq!(dict.len(), 0);
let diags = parser.take_diagnostics();
assert!(diags.iter().any(|d| d.code == DiagCode::StructInvalidDictKey));
assert!(diags
.iter()
.any(|d| d.code == DiagCode::StructInvalidDictKey));
} else {
panic!("Expected dict, got {:?}", obj);
}
@ -925,7 +952,9 @@ mod tests {
// Should have emitted STRUCT_DEPTH_EXCEEDED diagnostic
let diags = parser.take_diagnostics();
assert!(diags.iter().any(|d| d.code == DiagCode::StructDepthExceeded));
assert!(diags
.iter()
.any(|d| d.code == DiagCode::StructDepthExceeded));
}
#[test]
@ -950,7 +979,9 @@ mod tests {
// Should have emitted STRUCT_INVALID_DICT_VALUE diagnostic for missing value
let diags = parser.take_diagnostics();
assert!(diags.iter().any(|d| d.code == DiagCode::StructInvalidDictValue));
assert!(diags
.iter()
.any(|d| d.code == DiagCode::StructInvalidDictValue));
}
#[test]
@ -961,7 +992,9 @@ mod tests {
// Should return PdfNull with diagnostic
assert_eq!(obj, Some(PdfObject::Null));
let diags = parser.take_diagnostics();
assert!(diags.iter().any(|d| d.code == DiagCode::StructInvalidIndirectHeader));
assert!(diags
.iter()
.any(|d| d.code == DiagCode::StructInvalidIndirectHeader));
}
#[test]
@ -997,7 +1030,11 @@ mod tests {
Just("true".to_string()),
Just("false".to_string()),
any::<i64>().prop_map(|n| n.to_string()),
any::<f64>().prop_map(|f| if f.is_finite() { f.to_string() } else { "0.0".to_string() }),
any::<f64>().prop_map(|f| if f.is_finite() {
f.to_string()
} else {
"0.0".to_string()
}),
// Names
"[a-zA-Z]{1,10}".prop_map(|s| format!("/{}", s)),
// Strings
@ -1108,7 +1145,9 @@ mod tests {
// Should have emitted STRUCT_INTEGER_OVERFLOW diagnostic
let diags = parser.take_diagnostics();
assert!(diags.iter().any(|d| d.code == DiagCode::StructIntegerOverflow));
assert!(diags
.iter()
.any(|d| d.code == DiagCode::StructIntegerOverflow));
}
#[test]
@ -1123,7 +1162,9 @@ mod tests {
// Should have emitted STRUCT_INTEGER_OVERFLOW diagnostic
let diags = parser.take_diagnostics();
assert!(diags.iter().any(|d| d.code == DiagCode::StructIntegerOverflow));
assert!(diags
.iter()
.any(|d| d.code == DiagCode::StructIntegerOverflow));
}
#[test]
@ -1137,7 +1178,9 @@ mod tests {
// Should have emitted STRUCT_INVALID_INDIRECT_HEADER diagnostic
let diags = parser.take_diagnostics();
assert!(diags.iter().any(|d| d.code == DiagCode::StructInvalidIndirectHeader));
assert!(diags
.iter()
.any(|d| d.code == DiagCode::StructInvalidIndirectHeader));
}
#[test]
@ -1150,7 +1193,9 @@ mod tests {
// Should have emitted STRUCT_INVALID_INDIRECT_HEADER diagnostic
let diags = parser.take_diagnostics();
assert!(diags.iter().any(|d| d.code == DiagCode::StructInvalidIndirectHeader));
assert!(diags
.iter()
.any(|d| d.code == DiagCode::StructInvalidIndirectHeader));
}
#[test]

View file

@ -126,7 +126,11 @@ impl PdfStream {
/// Create a new stream.
#[inline]
pub fn new(dict: PdfDict, offset: u64, len_hint: Option<u64>) -> Self {
Self { dict, offset, len_hint }
Self {
dict,
offset,
len_hint,
}
}
/// Get the /Filter entry from the stream dictionary.
@ -149,16 +153,18 @@ impl PdfStream {
}
PdfObject::Array(arr) => arr
.iter()
.filter_map(|obj| obj.as_name().map(|n| {
// Strip leading slash from filter name for normalization
let name_str: &str = n.as_ref();
let stripped = if name_str.starts_with('/') {
&name_str[1..]
} else {
name_str
};
stripped.to_string()
}))
.filter_map(|obj| {
obj.as_name().map(|n| {
// Strip leading slash from filter name for normalization
let name_str: &str = n.as_ref();
let stripped = if name_str.starts_with('/') {
&name_str[1..]
} else {
name_str
};
stripped.to_string()
})
})
.collect(),
_ => return None,
})
@ -521,7 +527,10 @@ mod tests {
let obj = PdfObject::Dict(Box::new(dict.clone()));
assert!(obj.as_dict().is_some());
assert_eq!(obj.as_dict().unwrap().get("Type").unwrap().as_name(), Some("Page"));
assert_eq!(
obj.as_dict().unwrap().get("Type").unwrap().as_name(),
Some("Page")
);
assert_eq!(PdfObject::Integer(42).as_dict(), None);
}
@ -544,7 +553,11 @@ mod tests {
#[test]
fn test_as_array() {
let arr = vec![PdfObject::Integer(1), PdfObject::Integer(2), PdfObject::Integer(3)];
let arr = vec![
PdfObject::Integer(1),
PdfObject::Integer(2),
PdfObject::Integer(3),
];
let obj = PdfObject::Array(Box::new(arr.clone()));
assert!(obj.as_array().is_some());
@ -639,7 +652,10 @@ mod tests {
fn test_pdf_object_indirect_variant() {
let obj_ref = ObjRef::new(5, 1);
let inner = PdfObject::Name(intern("Test"));
let indirect = PdfIndirect { id: obj_ref, obj: inner };
let indirect = PdfIndirect {
id: obj_ref,
obj: inner,
};
let obj = PdfObject::Indirect(Box::new(indirect));
assert!(obj.as_indirect().is_some());

View file

@ -29,9 +29,9 @@
use std::collections::{HashMap, HashSet};
use std::sync::{Arc, RwLock};
use crate::parser::object::{ObjRef, PdfObject, PdfStream, ObjectParser};
use crate::diagnostics::{DiagCode, Diagnostic};
use crate::parser::object::{ObjRef, ObjectParser, PdfObject, PdfStream};
use crate::parser::stream::{decode_stream, ExtractionOptions, PdfSource};
use crate::diagnostics::{Diagnostic, DiagCode};
/// Maximum depth for `/Extends` chain to prevent adversarial deep chains.
const MAX_EXTENDS_DEPTH: u8 = 16;
@ -58,9 +58,15 @@ impl std::fmt::Display for ObjStmError {
fn fmt(&self, f: &mut std::fmt::Formatter<'_>) -> std::fmt::Result {
match self {
ObjStmError::MissingKey { key } => write!(f, "Missing required key: {}", key),
ObjStmError::InvalidFormat { msg } => write!(f, "Invalid object stream format: {}", msg),
ObjStmError::CircularRef { obj_ref } => write!(f, "Circular reference in /Extends chain at {}", obj_ref),
ObjStmError::DepthExceeded { max } => write!(f, "Extends chain depth exceeded (max {})", max),
ObjStmError::InvalidFormat { msg } => {
write!(f, "Invalid object stream format: {}", msg)
}
ObjStmError::CircularRef { obj_ref } => {
write!(f, "Circular reference in /Extends chain at {}", obj_ref)
}
ObjStmError::DepthExceeded { max } => {
write!(f, "Extends chain depth exceeded (max {})", max)
}
ObjStmError::DecompressionFailed => write!(f, "Stream decompression failed"),
}
}
@ -184,13 +190,11 @@ impl ObjectStmParser {
// Load the object stream
let stream = match resolve_fn(host_objstm_ref) {
Some(s) => s,
None => return PdfObject::Null, // Not found
None => return PdfObject::Null, // Not found
};
// Create a wrapper that handles the recursion properly
let resolve_wrapper = |ref_obj: ObjRef| -> Option<PdfStream> {
resolve_fn(ref_obj)
};
let resolve_wrapper = |ref_obj: ObjRef| -> Option<PdfStream> { resolve_fn(ref_obj) };
match self.load_object_stream_impl(
host_objstm_ref,
@ -207,15 +211,13 @@ impl ObjectStmParser {
}
// Return the requested object by 0-based index
entry.get(embedded_index as usize)
entry
.get(embedded_index as usize)
.map(|(_, obj)| obj.clone())
.unwrap_or(PdfObject::Null)
}
Err(e) => {
self.emit_diagnostic(
e.diag_code(),
format!("Object stream error: {}", e),
);
self.emit_diagnostic(e.diag_code(), format!("Object stream error: {}", e));
PdfObject::Null
}
}
@ -257,9 +259,7 @@ impl ObjectStmParser {
}
// Create a wrapper that handles the recursion properly
let resolve_wrapper = |ref_obj: ObjRef| -> Option<PdfStream> {
resolve_fn(ref_obj)
};
let resolve_wrapper = |ref_obj: ObjRef| -> Option<PdfStream> { resolve_fn(ref_obj) };
match self.load_object_stream_impl(
obj_stm_ref,
@ -302,12 +302,17 @@ impl ObjectStmParser {
// Check for circular reference
if in_progress.contains(&obj_stm_ref) {
return Err(ObjStmError::CircularRef { obj_ref: obj_stm_ref });
return Err(ObjStmError::CircularRef {
obj_ref: obj_stm_ref,
});
}
// Check cache first
{
let cache = self.cache.read().map_err(|_| ObjStmError::DecompressionFailed)?;
let cache = self
.cache
.read()
.map_err(|_| ObjStmError::DecompressionFailed)?;
if let Some(cached) = cache.get(&obj_stm_ref) {
// Return the cached Arc directly (no clone)
return Ok(cached.clone());
@ -323,7 +328,9 @@ impl ObjectStmParser {
let n = stream_dict
.get("/N")
.and_then(|obj| obj.as_int())
.ok_or_else(|| ObjStmError::MissingKey { key: "/N".to_string() })? as u32;
.ok_or_else(|| ObjStmError::MissingKey {
key: "/N".to_string(),
})? as u32;
let first = stream_dict
.get("/First")
@ -344,7 +351,11 @@ impl ObjectStmParser {
}
#[cfg(test)]
eprintln!("DEBUG: decompressed {} bytes, first: {:?}", decompressed.len(), decompressed.get(0..20));
eprintln!(
"DEBUG: decompressed {} bytes, first: {:?}",
decompressed.len(),
decompressed.get(0..20)
);
if decompressed.is_empty() {
in_progress.remove(&obj_stm_ref);
@ -356,7 +367,11 @@ impl ObjectStmParser {
in_progress.remove(&obj_stm_ref);
self.emit_diagnostic(
DiagCode::StructInvalidObjstm,
format!("ObjStm /First offset {} exceeds decompressed size {}", first, decompressed.len()),
format!(
"ObjStm /First offset {} exceeds decompressed size {}",
first,
decompressed.len()
),
);
return Ok(Arc::new(Vec::new()));
}
@ -421,7 +436,10 @@ impl ObjectStmParser {
let remaining = &decompressed[obj_start..];
#[cfg(test)]
eprintln!("DEBUG: Parsing object {} at offset {}, remaining bytes: {:?}", obj_number, obj_start, remaining);
eprintln!(
"DEBUG: Parsing object {} at offset {}, remaining bytes: {:?}",
obj_number, obj_start, remaining
);
let mut obj_parser = ObjectParser::new(remaining);
@ -478,12 +496,16 @@ impl ObjectStmParser {
Err(ObjStmError::CircularRef { .. }) => {
// Propagate circular reference errors
in_progress.remove(&obj_stm_ref);
return Err(ObjStmError::CircularRef { obj_ref: extends_ref });
return Err(ObjStmError::CircularRef {
obj_ref: extends_ref,
});
}
Err(ObjStmError::DepthExceeded { .. }) => {
// Propagate depth exceeded errors
in_progress.remove(&obj_stm_ref);
return Err(ObjStmError::DepthExceeded { max: MAX_EXTENDS_DEPTH });
return Err(ObjStmError::DepthExceeded {
max: MAX_EXTENDS_DEPTH,
});
}
Err(_) => {
// Failed to parse parent - just use our objects
@ -594,7 +616,10 @@ mod tests {
dict.insert(intern("/N"), PdfObject::Integer(2));
dict.insert(intern("/First"), PdfObject::Integer(header.len() as i64));
dict.insert(intern("/Filter"), PdfObject::Name(intern("/FlateDecode")));
dict.insert(intern("/Length"), PdfObject::Integer(compressed.len() as i64));
dict.insert(
intern("/Length"),
PdfObject::Integer(compressed.len() as i64),
);
// Create a PdfStream with the dict and offset 0 (for MemorySource)
let stream = PdfStream::new(dict.clone(), 0, Some(compressed.len() as u64));
@ -606,18 +631,13 @@ mod tests {
// Mock resolve function that returns the stream
let obj_stm_ref = ObjRef::new(10, 0);
let stream_clone = stream.clone();
let result = parser.load_object_stream(
obj_stm_ref,
&stream,
&source,
move |ref_obj| {
if ref_obj == obj_stm_ref {
Some(stream_clone.clone())
} else {
None
}
},
);
let result = parser.load_object_stream(obj_stm_ref, &stream, &source, move |ref_obj| {
if ref_obj == obj_stm_ref {
Some(stream_clone.clone())
} else {
None
}
});
assert!(result.is_ok());
let entry = result.unwrap();
@ -706,7 +726,10 @@ mod tests {
dict.insert(intern("/N"), PdfObject::Integer(10));
dict.insert(intern("/First"), PdfObject::Integer(first as i64));
dict.insert(intern("/Filter"), PdfObject::Name(intern("/FlateDecode")));
dict.insert(intern("/Length"), PdfObject::Integer(compressed.len() as i64));
dict.insert(
intern("/Length"),
PdfObject::Integer(compressed.len() as i64),
);
// Create a PdfStream with the dict and offset 0 (for MemorySource)
let stream = PdfStream::new(dict.clone(), 0, Some(compressed.len() as u64));
@ -716,18 +739,13 @@ mod tests {
let obj_stm_ref = ObjRef::new(10, 0);
let stream_clone = stream.clone();
let result = parser.load_object_stream(
obj_stm_ref,
&stream,
&source,
move |ref_obj| {
if ref_obj == obj_stm_ref {
Some(stream_clone.clone())
} else {
None
}
},
);
let result = parser.load_object_stream(obj_stm_ref, &stream, &source, move |ref_obj| {
if ref_obj == obj_stm_ref {
Some(stream_clone.clone())
} else {
None
}
});
assert!(result.is_ok());
let entry = result.unwrap();
@ -754,12 +772,7 @@ mod tests {
let source = MemorySource::new(vec![0u8; 100]);
let parser = ObjectStmParser::default();
let result = parser.load_object_stream(
ObjRef::new(1, 0),
&stream,
&source,
|_| None,
);
let result = parser.load_object_stream(ObjRef::new(1, 0), &stream, &source, |_| None);
assert!(matches!(result, Err(ObjStmError::MissingKey { key }) if key == "/N"));
}
@ -773,12 +786,7 @@ mod tests {
let source = MemorySource::new(vec![0u8; 100]);
let parser = ObjectStmParser::default();
let result = parser.load_object_stream(
ObjRef::new(1, 0),
&stream,
&source,
|_| None,
);
let result = parser.load_object_stream(ObjRef::new(1, 0), &stream, &source, |_| None);
assert!(matches!(result, Err(ObjStmError::MissingKey { key }) if key == "/First"));
}
@ -799,18 +807,13 @@ mod tests {
// Mock resolve function that returns the same stream (circular reference)
let self_ref = ObjRef::new(1, 0);
let stream_clone = stream.clone();
let result = parser.load_object_stream(
self_ref,
&stream,
&source,
move |ref_obj| {
if ref_obj == self_ref {
Some(stream_clone.clone())
} else {
None
}
},
);
let result = parser.load_object_stream(self_ref, &stream, &source, move |ref_obj| {
if ref_obj == self_ref {
Some(stream_clone.clone())
} else {
None
}
});
assert!(matches!(result, Err(ObjStmError::CircularRef { .. })));
}
@ -838,7 +841,10 @@ mod tests {
dict.insert(intern("/N"), PdfObject::Integer(2));
dict.insert(intern("/First"), PdfObject::Integer(header.len() as i64));
dict.insert(intern("/Filter"), PdfObject::Name(intern("/FlateDecode")));
dict.insert(intern("/Length"), PdfObject::Integer(compressed.len() as i64));
dict.insert(
intern("/Length"),
PdfObject::Integer(compressed.len() as i64),
);
let stream = PdfStream::new(dict.clone(), 0, Some(compressed.len() as u64));
@ -849,18 +855,13 @@ mod tests {
let stream_clone = stream.clone();
// First call - should load and cache
let result1 = parser.load_object_stream(
obj_stm_ref,
&stream,
&source,
move |ref_obj| {
if ref_obj == obj_stm_ref {
Some(stream_clone.clone())
} else {
None
}
},
);
let result1 = parser.load_object_stream(obj_stm_ref, &stream, &source, move |ref_obj| {
if ref_obj == obj_stm_ref {
Some(stream_clone.clone())
} else {
None
}
});
assert!(result1.is_ok());
let entry1 = result1.unwrap();
@ -893,9 +894,15 @@ mod tests {
let mut parent_dict = PdfDict::new();
parent_dict.insert(intern("/Type"), PdfObject::Name(intern("/ObjStm")));
parent_dict.insert(intern("/N"), PdfObject::Integer(3));
parent_dict.insert(intern("/First"), PdfObject::Integer(parent_header.len() as i64));
parent_dict.insert(
intern("/First"),
PdfObject::Integer(parent_header.len() as i64),
);
parent_dict.insert(intern("/Filter"), PdfObject::Name(intern("/FlateDecode")));
parent_dict.insert(intern("/Length"), PdfObject::Integer(parent_compressed.len() as i64));
parent_dict.insert(
intern("/Length"),
PdfObject::Integer(parent_compressed.len() as i64),
);
// Create child ObjStm (objects 4-5) that extends parent
let child_header = b"4 0 5 4";
@ -913,9 +920,15 @@ mod tests {
let mut child_dict = PdfDict::new();
child_dict.insert(intern("/Type"), PdfObject::Name(intern("/ObjStm")));
child_dict.insert(intern("/N"), PdfObject::Integer(2));
child_dict.insert(intern("/First"), PdfObject::Integer(child_header.len() as i64));
child_dict.insert(
intern("/First"),
PdfObject::Integer(child_header.len() as i64),
);
child_dict.insert(intern("/Filter"), PdfObject::Name(intern("/FlateDecode")));
child_dict.insert(intern("/Length"), PdfObject::Integer(child_compressed.len() as i64));
child_dict.insert(
intern("/Length"),
PdfObject::Integer(child_compressed.len() as i64),
);
child_dict.insert(intern("/Extends"), PdfObject::Ref(parent_ref));
let parser = ObjectStmParser::default();
@ -927,29 +940,16 @@ mod tests {
let parent_dict_clone = parent_dict.clone();
let child_stream = PdfStream::new(child_dict_clone.clone(), 0, None);
let result = parser.load_object_stream(
child_ref,
&child_stream,
&source,
move |ref_obj| {
if ref_obj == parent_ref {
// Return parent stream
Some(PdfStream::new(
parent_dict_clone.clone(),
0,
None,
))
} else if ref_obj == child_ref {
Some(PdfStream::new(
child_dict_clone.clone(),
0,
None,
))
} else {
None
}
},
);
let result = parser.load_object_stream(child_ref, &child_stream, &source, move |ref_obj| {
if ref_obj == parent_ref {
// Return parent stream
Some(PdfStream::new(parent_dict_clone.clone(), 0, None))
} else if ref_obj == child_ref {
Some(PdfStream::new(child_dict_clone.clone(), 0, None))
} else {
None
}
});
// The test may not fully work due to source limitations,
// but it verifies the /Extends handling doesn't crash
@ -979,7 +979,10 @@ mod tests {
dict.insert(intern("/N"), PdfObject::Integer(2));
dict.insert(intern("/First"), PdfObject::Integer(header.len() as i64));
dict.insert(intern("/Filter"), PdfObject::Name(intern("/FlateDecode")));
dict.insert(intern("/Length"), PdfObject::Integer(compressed.len() as i64));
dict.insert(
intern("/Length"),
PdfObject::Integer(compressed.len() as i64),
);
let source = MemorySource::new(compressed);
let parser = ObjectStmParser::default();
@ -1053,7 +1056,10 @@ mod tests {
dict.insert(intern("/N"), PdfObject::Integer(3));
dict.insert(intern("/First"), PdfObject::Integer(header.len() as i64));
dict.insert(intern("/Filter"), PdfObject::Name(intern("/FlateDecode")));
dict.insert(intern("/Length"), PdfObject::Integer(compressed.len() as i64));
dict.insert(
intern("/Length"),
PdfObject::Integer(compressed.len() as i64),
);
let source = MemorySource::new(compressed);
let parser = ObjectStmParser::default();
@ -1061,22 +1067,13 @@ mod tests {
let obj_stm_ref = ObjRef::new(10, 0);
let dict_clone = dict.clone();
let stream = PdfStream::new(dict.clone(), 0, Some(compressed_len));
let result = parser.load_object_stream(
obj_stm_ref,
&stream,
&source,
move |ref_obj| {
if ref_obj == obj_stm_ref {
Some(PdfStream::new(
dict_clone.clone(),
0,
Some(compressed_len),
))
} else {
None
}
},
);
let result = parser.load_object_stream(obj_stm_ref, &stream, &source, move |ref_obj| {
if ref_obj == obj_stm_ref {
Some(PdfStream::new(dict_clone.clone(), 0, Some(compressed_len)))
} else {
None
}
});
// Should succeed with partial objects
assert!(result.is_ok());
@ -1121,7 +1118,10 @@ mod tests {
dict.insert(intern("/N"), PdfObject::Integer(2));
dict.insert(intern("/First"), PdfObject::Integer(header.len() as i64));
dict.insert(intern("/Filter"), PdfObject::Name(intern("/FlateDecode")));
dict.insert(intern("/Length"), PdfObject::Integer(compressed.len() as i64));
dict.insert(
intern("/Length"),
PdfObject::Integer(compressed.len() as i64),
);
// Create parser with very small decompression limit
let parser = ObjectStmParser::new(max_bytes);
@ -1130,22 +1130,13 @@ mod tests {
let obj_stm_ref = ObjRef::new(10, 0);
let dict_clone = dict.clone();
let stream = PdfStream::new(dict.clone(), 0, None);
let result = parser.load_object_stream(
obj_stm_ref,
&stream,
&source,
move |ref_obj| {
if ref_obj == obj_stm_ref {
Some(PdfStream::new(
dict_clone.clone(),
0,
None,
))
} else {
None
}
},
);
let result = parser.load_object_stream(obj_stm_ref, &stream, &source, move |ref_obj| {
if ref_obj == obj_stm_ref {
Some(PdfStream::new(dict_clone.clone(), 0, None))
} else {
None
}
});
// The result should be ok (we get what we can before hitting the limit)
// but diagnostics should be emitted
@ -1183,7 +1174,10 @@ mod tests {
dict.insert(intern("/N"), PdfObject::Integer(1));
dict.insert(intern("/First"), PdfObject::Integer(header.len() as i64));
dict.insert(intern("/Filter"), PdfObject::Name(intern("/FlateDecode")));
dict.insert(intern("/Length"), PdfObject::Integer(compressed.len() as i64));
dict.insert(
intern("/Length"),
PdfObject::Integer(compressed.len() as i64),
);
let source = MemorySource::new(compressed);
let parser = ObjectStmParser::default();
@ -1191,22 +1185,13 @@ mod tests {
let obj_stm_ref = ObjRef::new(10, 0);
let dict_clone = dict.clone();
let stream = PdfStream::new(dict.clone(), 0, None);
let result = parser.load_object_stream(
obj_stm_ref,
&stream,
&source,
move |ref_obj| {
if ref_obj == obj_stm_ref {
Some(PdfStream::new(
dict_clone.clone(),
0,
None,
))
} else {
None
}
},
);
let result = parser.load_object_stream(obj_stm_ref, &stream, &source, move |ref_obj| {
if ref_obj == obj_stm_ref {
Some(PdfStream::new(dict_clone.clone(), 0, None))
} else {
None
}
});
assert!(result.is_ok());
let entry = result.unwrap();
@ -1238,7 +1223,10 @@ mod tests {
base_dict.insert(intern("/N"), PdfObject::Integer(1));
base_dict.insert(intern("/First"), PdfObject::Integer(header.len() as i64));
base_dict.insert(intern("/Filter"), PdfObject::Name(intern("/FlateDecode")));
base_dict.insert(intern("/Length"), PdfObject::Integer(compressed.len() as i64));
base_dict.insert(
intern("/Length"),
PdfObject::Integer(compressed.len() as i64),
);
// Create a chain of ObjStms where each extends the previous
// We'll create 18 dicts (0-17), each extending the previous
@ -1247,7 +1235,10 @@ mod tests {
let mut dict = base_dict.clone();
if i > 0 {
// This ObjStm extends the previous one
dict.insert(intern("/Extends"), PdfObject::Ref(ObjRef::new(100 + (i as u32) - 1, 0)));
dict.insert(
intern("/Extends"),
PdfObject::Ref(ObjRef::new(100 + (i as u32) - 1, 0)),
);
}
dicts.push(dict);
}
@ -1259,20 +1250,15 @@ mod tests {
let obj_stm_17_ref = ObjRef::new(117, 0);
let stream_17 = PdfStream::new(dicts[17].clone(), 0, None);
let result = parser.load_object_stream(
obj_stm_17_ref,
&stream_17,
&source,
|ref_obj| {
// Return a stream for any ref in the chain
if ref_obj.object >= 100 && ref_obj.object <= 117 {
let idx = (ref_obj.object - 100) as usize;
Some(PdfStream::new(dicts[idx].clone(), 0, None))
} else {
None
}
},
);
let result = parser.load_object_stream(obj_stm_17_ref, &stream_17, &source, |ref_obj| {
// Return a stream for any ref in the chain
if ref_obj.object >= 100 && ref_obj.object <= 117 {
let idx = (ref_obj.object - 100) as usize;
Some(PdfStream::new(dicts[idx].clone(), 0, None))
} else {
None
}
});
// Should fail with DepthExceeded
assert!(matches!(result, Err(ObjStmError::DepthExceeded { .. })));

View file

@ -8,9 +8,9 @@
use std::collections::HashMap;
use crate::parser::{Diagnostic, DiagCode};
use crate::parser::object::{intern, ObjRef, PdfDict, PdfObject};
use crate::parser::xref::XrefResolver;
use crate::parser::{DiagCode, Diagnostic};
/// Base state for OCG visibility in the default configuration.
///
@ -102,15 +102,13 @@ impl Ocmd {
// Parse /OCGs (can be a single ref or an array)
let ocgs = match dict.get("OCGs") {
Some(PdfObject::Ref(ref_)) => vec![*ref_],
Some(PdfObject::Array(arr)) => arr
.iter()
.filter_map(|o| o.as_ref())
.collect(),
Some(PdfObject::Array(arr)) => arr.iter().filter_map(|o| o.as_ref()).collect(),
_ => return None,
};
// Parse /P (policy; defaults to AnyOn if absent per spec)
let policy = dict.get("P")
let policy = dict
.get("P")
.and_then(|o| o.as_name())
.and_then(OcmdPolicy::from_name)
.unwrap_or(OcmdPolicy::AnyOn);
@ -153,7 +151,8 @@ impl OcGroup {
// Parse /Name (required per spec, but we handle missing)
if let Some(name_obj) = dict.get("Name") {
group.name = name_obj.as_string()
group.name = name_obj
.as_string()
.or_else(|| name_obj.as_name().map(|s| s.as_bytes()))
.and_then(|bytes| String::from_utf8(bytes.to_vec()).ok());
}
@ -245,7 +244,8 @@ impl OcProperties {
/// Evaluate an OCMD policy against current OCG states.
fn evaluate_ocmd_policy(&self, ocmd: &Ocmd) -> bool {
let ocg_states: Vec<bool> = ocmd.ocgs
let ocg_states: Vec<bool> = ocmd
.ocgs
.iter()
.map(|&ref_| self.is_visible(ref_))
.collect();
@ -279,10 +279,7 @@ impl Default for OcProperties {
/// # Returns
/// An `OcProperties` struct containing the parsed OCG information.
/// If `oc_props_ref` is None, returns `OcProperties::not_present()`.
pub fn parse_oc_properties(
resolver: &XrefResolver,
oc_props_ref: Option<ObjRef>,
) -> OcProperties {
pub fn parse_oc_properties(resolver: &XrefResolver, oc_props_ref: Option<ObjRef>) -> OcProperties {
let oc_props_ref = match oc_props_ref {
Some(r) => r,
None => return OcProperties::not_present(),
@ -316,7 +313,10 @@ pub fn parse_oc_properties(
None => {
diagnostics.push(Diagnostic::with_dynamic_no_offset(
DiagCode::StructUnexpectedEof,
format!("/OCProperties is not a dictionary (type: {})", oc_props_obj.type_name()),
format!(
"/OCProperties is not a dictionary (type: {})",
oc_props_obj.type_name()
),
));
oc_properties.diagnostics = diagnostics;
return oc_properties;
@ -325,10 +325,7 @@ pub fn parse_oc_properties(
// Parse /OCGs array (required per spec)
let ocg_refs: Vec<ObjRef> = match oc_props_dict.get("OCGs") {
Some(PdfObject::Array(arr)) => arr
.iter()
.filter_map(|o| o.as_ref())
.collect(),
Some(PdfObject::Array(arr)) => arr.iter().filter_map(|o| o.as_ref()).collect(),
Some(other) => {
diagnostics.push(Diagnostic::with_dynamic_no_offset(
DiagCode::StructUnexpectedEof,
@ -385,14 +382,17 @@ pub fn parse_oc_properties(
};
// Parse /BaseState (defaults to ON if absent)
oc_properties.base_state = default_config.get("BaseState")
oc_properties.base_state = default_config
.get("BaseState")
.and_then(|o| o.as_name())
.and_then(BaseState::from_name)
.unwrap_or(BaseState::On);
// Initialize all OCGs to base state
for &ocg_ref in &ocg_refs {
oc_properties.default_visibility.insert(ocg_ref, oc_properties.base_state.as_bool());
oc_properties
.default_visibility
.insert(ocg_ref, oc_properties.base_state.as_bool());
}
// Apply /ON array (overrides BaseState for these OCGs)
@ -433,7 +433,10 @@ mod tests {
fn make_test_ocg(obj_ref: ObjRef, name: &str, intent: Option<&str>) -> PdfObject {
let mut dict = PdfDict::new();
dict.insert(intern("Type"), PdfObject::Name(intern("OCG")));
dict.insert(intern("Name"), PdfObject::String(Box::new(name.as_bytes().to_vec())));
dict.insert(
intern("Name"),
PdfObject::String(Box::new(name.as_bytes().to_vec())),
);
if let Some(i) = intent {
dict.insert(intern("Intent"), PdfObject::Name(intern(i)));
}
@ -444,7 +447,10 @@ mod tests {
fn test_base_state_from_name() {
assert_eq!(BaseState::from_name("ON"), Some(BaseState::On));
assert_eq!(BaseState::from_name("OFF"), Some(BaseState::Off));
assert_eq!(BaseState::from_name("Unchanged"), Some(BaseState::Unchanged));
assert_eq!(
BaseState::from_name("Unchanged"),
Some(BaseState::Unchanged)
);
assert_eq!(BaseState::from_name("Invalid"), None);
}
@ -495,10 +501,13 @@ mod tests {
// Create /OCProperties dict
let mut oc_props_dict = PdfDict::new();
oc_props_dict.insert(intern("OCGs"), PdfObject::Array(Box::new(vec![
PdfObject::Ref(ocg1_ref),
PdfObject::Ref(ocg2_ref),
])));
oc_props_dict.insert(
intern("OCGs"),
PdfObject::Array(Box::new(vec![
PdfObject::Ref(ocg1_ref),
PdfObject::Ref(ocg2_ref),
])),
);
let mut default_config = PdfDict::new();
default_config.insert(intern("BaseState"), PdfObject::Name(intern("ON")));
@ -527,10 +536,13 @@ mod tests {
resolver.cache_object(ocg2_ref, make_test_ocg(ocg2_ref, "Layer2", None));
let mut oc_props_dict = PdfDict::new();
oc_props_dict.insert(intern("OCGs"), PdfObject::Array(Box::new(vec![
PdfObject::Ref(ocg1_ref),
PdfObject::Ref(ocg2_ref),
])));
oc_props_dict.insert(
intern("OCGs"),
PdfObject::Array(Box::new(vec![
PdfObject::Ref(ocg1_ref),
PdfObject::Ref(ocg2_ref),
])),
);
let mut default_config = PdfDict::new();
default_config.insert(intern("BaseState"), PdfObject::Name(intern("OFF")));
@ -559,18 +571,24 @@ mod tests {
resolver.cache_object(ocg3_ref, make_test_ocg(ocg3_ref, "Layer3", None));
let mut oc_props_dict = PdfDict::new();
oc_props_dict.insert(intern("OCGs"), PdfObject::Array(Box::new(vec![
PdfObject::Ref(ocg1_ref),
PdfObject::Ref(ocg2_ref),
PdfObject::Ref(ocg3_ref),
])));
oc_props_dict.insert(
intern("OCGs"),
PdfObject::Array(Box::new(vec![
PdfObject::Ref(ocg1_ref),
PdfObject::Ref(ocg2_ref),
PdfObject::Ref(ocg3_ref),
])),
);
let mut default_config = PdfDict::new();
default_config.insert(intern("BaseState"), PdfObject::Name(intern("OFF")));
default_config.insert(intern("ON"), PdfObject::Array(Box::new(vec![
PdfObject::Ref(ocg1_ref),
PdfObject::Ref(ocg2_ref),
])));
default_config.insert(
intern("ON"),
PdfObject::Array(Box::new(vec![
PdfObject::Ref(ocg1_ref),
PdfObject::Ref(ocg2_ref),
])),
);
oc_props_dict.insert(intern("D"), PdfObject::Dict(Box::new(default_config)));
let oc_props_ref = ObjRef::new(1, 0);
@ -595,16 +613,20 @@ mod tests {
resolver.cache_object(ocg2_ref, make_test_ocg(ocg2_ref, "Layer2", None));
let mut oc_props_dict = PdfDict::new();
oc_props_dict.insert(intern("OCGs"), PdfObject::Array(Box::new(vec![
PdfObject::Ref(ocg1_ref),
PdfObject::Ref(ocg2_ref),
])));
oc_props_dict.insert(
intern("OCGs"),
PdfObject::Array(Box::new(vec![
PdfObject::Ref(ocg1_ref),
PdfObject::Ref(ocg2_ref),
])),
);
let mut default_config = PdfDict::new();
default_config.insert(intern("BaseState"), PdfObject::Name(intern("ON")));
default_config.insert(intern("OFF"), PdfObject::Array(Box::new(vec![
PdfObject::Ref(ocg2_ref),
])));
default_config.insert(
intern("OFF"),
PdfObject::Array(Box::new(vec![PdfObject::Ref(ocg2_ref)])),
);
oc_props_dict.insert(intern("D"), PdfObject::Dict(Box::new(default_config)));
let oc_props_ref = ObjRef::new(1, 0);
@ -626,19 +648,22 @@ mod tests {
resolver.cache_object(ocg1_ref, make_test_ocg(ocg1_ref, "Layer1", None));
let mut oc_props_dict = PdfDict::new();
oc_props_dict.insert(intern("OCGs"), PdfObject::Array(Box::new(vec![
PdfObject::Ref(ocg1_ref),
])));
oc_props_dict.insert(
intern("OCGs"),
PdfObject::Array(Box::new(vec![PdfObject::Ref(ocg1_ref)])),
);
let mut default_config = PdfDict::new();
default_config.insert(intern("BaseState"), PdfObject::Name(intern("OFF")));
// OCG in both /ON and /OFF: /OFF wins per spec
default_config.insert(intern("ON"), PdfObject::Array(Box::new(vec![
PdfObject::Ref(ocg1_ref),
])));
default_config.insert(intern("OFF"), PdfObject::Array(Box::new(vec![
PdfObject::Ref(ocg1_ref),
])));
default_config.insert(
intern("ON"),
PdfObject::Array(Box::new(vec![PdfObject::Ref(ocg1_ref)])),
);
default_config.insert(
intern("OFF"),
PdfObject::Array(Box::new(vec![PdfObject::Ref(ocg1_ref)])),
);
oc_props_dict.insert(intern("D"), PdfObject::Dict(Box::new(default_config)));
let oc_props_ref = ObjRef::new(1, 0);
@ -658,9 +683,10 @@ mod tests {
resolver.cache_object(ocg1_ref, make_test_ocg(ocg1_ref, "TestLayer", None));
let mut oc_props_dict = PdfDict::new();
oc_props_dict.insert(intern("OCGs"), PdfObject::Array(Box::new(vec![
PdfObject::Ref(ocg1_ref),
])));
oc_props_dict.insert(
intern("OCGs"),
PdfObject::Array(Box::new(vec![PdfObject::Ref(ocg1_ref)])),
);
let mut default_config = PdfDict::new();
default_config.insert(intern("BaseState"), PdfObject::Name(intern("ON")));
@ -699,10 +725,13 @@ mod tests {
let mut ocmd_dict = PdfDict::new();
ocmd_dict.insert(intern("Type"), PdfObject::Name(intern("OCMD")));
ocmd_dict.insert(intern("OCGs"), PdfObject::Array(Box::new(vec![
PdfObject::Ref(ocg1_ref),
PdfObject::Ref(ocg2_ref),
])));
ocmd_dict.insert(
intern("OCGs"),
PdfObject::Array(Box::new(vec![
PdfObject::Ref(ocg1_ref),
PdfObject::Ref(ocg2_ref),
])),
);
ocmd_dict.insert(intern("P"), PdfObject::Name(intern("AllOn")));
let ocmd = Ocmd::parse(&PdfObject::Dict(Box::new(ocmd_dict)));
@ -789,11 +818,17 @@ mod tests {
fn test_ocg_group_parse() {
let mut ocg_dict = PdfDict::new();
ocg_dict.insert(intern("Type"), PdfObject::Name(intern("OCG")));
ocg_dict.insert(intern("Name"), PdfObject::String(Box::new(b"TestLayer".to_vec())));
ocg_dict.insert(intern("Intent"), PdfObject::Array(Box::new(vec![
PdfObject::Name(intern("View")),
PdfObject::Name(intern("Design")),
])));
ocg_dict.insert(
intern("Name"),
PdfObject::String(Box::new(b"TestLayer".to_vec())),
);
ocg_dict.insert(
intern("Intent"),
PdfObject::Array(Box::new(vec![
PdfObject::Name(intern("View")),
PdfObject::Name(intern("Design")),
])),
);
let group = OcGroup::parse(&PdfObject::Dict(Box::new(ocg_dict)), &mut Vec::new());

View file

@ -9,10 +9,10 @@
//! - /Count indicates open (positive) or closed (negative) state
//! - /Dest or /A specify the destination
use crate::diagnostics::{DiagCode, Diagnostic};
use crate::parser::object::{ObjRef, PdfObject};
use crate::parser::pages::PageDict;
use crate::parser::xref::XrefResolver;
use crate::diagnostics::{Diagnostic, DiagCode};
use std::collections::HashSet;
/// Maximum depth of outline nesting to prevent stack overflow.
@ -173,12 +173,10 @@ fn decode_pdf_string(bytes: &[u8]) -> Result<String> {
/// Decode UTF-16BE string with BOM (bytes after 0xFE 0xFF).
fn decode_utf16be_bom(bytes: &[u8]) -> Result<String> {
if bytes.len() % 2 != 0 {
return Err(vec![
Diagnostic::with_static_no_offset(
DiagCode::StructInvalidUtf16,
"STRUCT_INVALID_UTF16: UTF-16BE string has odd length",
)
]);
return Err(vec![Diagnostic::with_static_no_offset(
DiagCode::StructInvalidUtf16,
"STRUCT_INVALID_UTF16: UTF-16BE string has odd length",
)]);
}
let utf16_chars: Vec<u16> = bytes
@ -187,12 +185,10 @@ fn decode_utf16be_bom(bytes: &[u8]) -> Result<String> {
.collect();
String::from_utf16(&utf16_chars).map_err(|_| {
vec![
Diagnostic::with_static_no_offset(
DiagCode::StructInvalidUtf16,
"STRUCT_INVALID_UTF16: Invalid UTF-16BE sequence",
)
]
vec![Diagnostic::with_static_no_offset(
DiagCode::StructInvalidUtf16,
"STRUCT_INVALID_UTF16: Invalid UTF-16BE sequence",
)]
})
}
@ -246,252 +242,252 @@ fn decode_pdfdocencoding(bytes: &[u8]) -> Result<String> {
// Key: octal value from spec, Value: Unicode codepoint
fn pdfdoc_override(byte: u8) -> Option<char> {
match byte {
0o010 => Some('\u{0000}'), // NUL
0o011 => Some('\u{0001}'), // SOH
0o012 => Some('\u{0002}'), // STX
0o013 => Some('\u{0003}'), // ETX
0o014 => Some('\u{0004}'), // EOT
0o015 => Some('\u{0005}'), // ENQ
0o016 => Some('\u{0006}'), // ACK
0o017 => Some('\u{0007}'), // BEL
0o020 => Some('\u{0008}'), // BS
0o021 => Some('\u{0009}'), // HT
0o022 => Some('\u{000A}'), // LF
0o023 => Some('\u{000B}'), // VT
0o024 => Some('\u{000C}'), // FF
0o025 => Some('\u{000D}'), // CR
0o026 => Some('\u{000E}'), // SO
0o027 => Some('\u{000F}'), // SI
0o030 => Some('\u{0010}'), // DLE
0o031 => Some('\u{0011}'), // DC1
0o032 => Some('\u{0012}'), // DC2
0o033 => Some('\u{0013}'), // DC3
0o034 => Some('\u{0014}'), // DC4
0o035 => Some('\u{0015}'), // NAK
0o036 => Some('\u{0016}'), // SYN
0o037 => Some('\u{0017}'), // ETB
0o040 => Some('\u{0020}'), // Space (same as Latin-1)
0o041 => Some('\u{0021}'), // !
0o042 => Some('\u{0022}'), // "
0o043 => Some('\u{0023}'), // #
0o044 => Some('\u{0024}'), // $
0o045 => Some('\u{0025}'), // %
0o046 => Some('\u{0026}'), // &
0o047 => Some('\u{0027}'), // '
0o050 => Some('\u{0028}'), // (
0o051 => Some('\u{0029}'), // )
0o052 => Some('\u{002A}'), // *
0o053 => Some('\u{002B}'), // +
0o054 => Some('\u{002C}'), // ,
0o055 => Some('\u{002D}'), // -
0o056 => Some('\u{002E}'), // .
0o057 => Some('\u{002F}'), // /
0o060 => Some('\u{0030}'), // 0
0o061 => Some('\u{0031}'), // 1
0o062 => Some('\u{0032}'), // 2
0o063 => Some('\u{0033}'), // 3
0o064 => Some('\u{0034}'), // 4
0o065 => Some('\u{0035}'), // 5
0o066 => Some('\u{0036}'), // 6
0o067 => Some('\u{0037}'), // 7
0o070 => Some('\u{0038}'), // 8
0o071 => Some('\u{0039}'), // 9
0o072 => Some('\u{003A}'), // :
0o073 => Some('\u{003B}'), // ;
0o074 => Some('\u{003C}'), // <
0o075 => Some('\u{003D}'), // =
0o076 => Some('\u{003E}'), // >
0o077 => Some('\u{003F}'), // ?
0o100 => Some('\u{0040}'), // @
0o101 => Some('\u{0041}'), // A
0o102 => Some('\u{0042}'), // B
0o103 => Some('\u{0043}'), // C
0o104 => Some('\u{0044}'), // D
0o105 => Some('\u{0045}'), // E
0o106 => Some('\u{0046}'), // F
0o107 => Some('\u{0047}'), // G
0o110 => Some('\u{0048}'), // H
0o111 => Some('\u{0049}'), // I
0o112 => Some('\u{004A}'), // J
0o113 => Some('\u{004B}'), // K
0o114 => Some('\u{004C}'), // L
0o115 => Some('\u{004D}'), // M
0o116 => Some('\u{004E}'), // N
0o117 => Some('\u{004F}'), // O
0o120 => Some('\u{0050}'), // P
0o121 => Some('\u{0051}'), // Q
0o122 => Some('\u{0052}'), // R
0o123 => Some('\u{0053}'), // S
0o124 => Some('\u{0054}'), // T
0o125 => Some('\u{0055}'), // U
0o126 => Some('\u{0056}'), // V
0o127 => Some('\u{0057}'), // W
0o130 => Some('\u{0058}'), // X
0o131 => Some('\u{0059}'), // Y
0o132 => Some('\u{005A}'), // Z
0o133 => Some('\u{005B}'), // [
0o134 => Some('\u{005C}'), // \
0o135 => Some('\u{005D}'), // ]
0o136 => Some('\u{005E}'), // ^
0o137 => Some('\u{005F}'), // _
0o140 => Some('\u{0060}'), // `
0o141 => Some('\u{0061}'), // a
0o142 => Some('\u{0062}'), // b
0o143 => Some('\u{0063}'), // c
0o144 => Some('\u{0064}'), // d
0o145 => Some('\u{0065}'), // e
0o146 => Some('\u{0066}'), // f
0o147 => Some('\u{0067}'), // g
0o150 => Some('\u{0068}'), // h
0o151 => Some('\u{0069}'), // i
0o152 => Some('\u{006A}'), // j
0o153 => Some('\u{006B}'), // k
0o154 => Some('\u{006C}'), // l
0o155 => Some('\u{006D}'), // m
0o156 => Some('\u{006E}'), // n
0o157 => Some('\u{006F}'), // o
0o160 => Some('\u{0070}'), // p
0o161 => Some('\u{0071}'), // q
0o162 => Some('\u{0072}'), // r
0o163 => Some('\u{0073}'), // s
0o164 => Some('\u{0074}'), // t
0o165 => Some('\u{0075}'), // u
0o166 => Some('\u{0076}'), // v
0o167 => Some('\u{0077}'), // w
0o170 => Some('\u{0078}'), // x
0o171 => Some('\u{0079}'), // y
0o172 => Some('\u{007A}'), // z
0o173 => Some('\u{007B}'), // {
0o174 => Some('\u{007C}'), // |
0o175 => Some('\u{007D}'), // }
0o176 => Some('\u{007E}'), // ~
0o200 => Some('\u{2022}'), // Bullet
0o201 => Some('\u{2020}'), // Dagger
0o202 => Some('\u{2021}'), // Double Dagger
0o203 => Some('\u{2026}'), // Ellipsis
0o204 => Some('\u{2014}'), // Em Dash
0o205 => Some('\u{2013}'), // En Dash
0o206 => Some('\u{0192}'), // Florin
0o207 => Some('\u{2044}'), // Fraction
0o210 => Some('\u{2039}'), // Single Left Angle Quote
0o211 => Some('\u{203A}'), // Single Right Angle Quote
0o212 => Some('\u{201C}'), // Double Left Quote
0o213 => Some('\u{201D}'), // Double Right Quote
0o214 => Some('\u{2018}'), // Single Left Quote
0o215 => Some('\u{2019}'), // Single Right Quote
0o216 => Some('\u{201A}'), // Single Low-9 Quote
0o217 => Some('\u{2122}'), // Trademark
0o220 => Some('\u{FB01}'), // fi ligature
0o221 => Some('\u{FB02}'), // fl ligature
0o222 => Some('\u{0141}'), // L with stroke
0o223 => Some('\u{0152}'), // OE ligature
0o224 => Some('\u{0133}'), // oe ligature
0o225 => Some('\u{0178}'), // Y with diaeresis
0o226 => Some('\u{00A1}'), // Inverted exclamation
0o227 => Some('\u{00BF}'), // Inverted question mark
0o230 => Some('\u{00A1}'), // Inverted exclamation (duplicate in spec)
0o231 => Some('\u{00BF}'), // Inverted question mark (duplicate in spec)
0o232 => Some('\u{00A2}'), // Cent sign
0o233 => Some('\u{00A3}'), // Pound sign
0o234 => Some('\u{00A5}'), // Yen sign
0o235 => Some('\u{20A7}'), // Peseta sign (changed in PDF 2.0, using original)
0o236 => Some('\u{0192}'), // Florin (duplicate)
0o240 => Some('\u{00E6}'), // ae ligature
0o241 => Some('\u{0153}'), // OE ligature (duplicate)
0o242 => Some('\u{0178}'), // Y with diaeresis (duplicate)
0o243 => Some('\u{00C1}'), // A with acute
0o244 => Some('\u{00C2}'), // A with circumflex
0o245 => Some('\u{00C4}'), // A with diaeresis
0o246 => Some('\u{00C0}'), // A with grave
0o247 => Some('\u{00C5}'), // A with ring
0o250 => Some('\u{00C7}'), // C with cedilla
0o251 => Some('\u{00C9}'), // E with acute
0o252 => Some('\u{00C9}'), // E with acute (duplicate, using correct value)
0o253 => Some('\u{00CA}'), // E with circumflex
0o254 => Some('\u{00CB}'), // E with diaeresis
0o255 => Some('\u{00C8}'), // E with grave
0o256 => Some('\u{00CD}'), // I with acute
0o257 => Some('\u{00CE}'), // I with circumflex
0o260 => Some('\u{00CF}'), // I with diaeresis
0o261 => Some('\u{00CC}'), // I with grave
0o262 => Some('\u{00D1}'), // N with tilde
0o263 => Some('\u{00D3}'), // O with acute
0o264 => Some('\u{00D4}'), // O with circumflex
0o265 => Some('\u{00D6}'), // O with diaeresis
0o266 => Some('\u{00D2}'), // O with grave
0o267 => Some('\u{00D8}'), // O with stroke
0o270 => Some('\u{0152}'), // OE ligature (duplicate)
0o271 => Some('\u{00D5}'), // O with tilde
0o272 => Some('\u{00D7}'), // Multiplication
0o273 => Some('\u{00F7}'), // Division
0o274 => Some('\u{0178}'), // Y with diaeresis (duplicate)
0o275 => Some('\u{00E1}'), // a with acute
0o276 => Some('\u{00E2}'), // a with circumflex
0o277 => Some('\u{00E4}'), // a with diaeresis
0o300 => Some('\u{00E0}'), // a with grave
0o301 => Some('\u{00E5}'), // a with ring
0o302 => Some('\u{00E7}'), // c with cedilla
0o303 => Some('\u{00E9}'), // e with acute
0o304 => Some('\u{00EA}'), // e with circumflex
0o305 => Some('\u{00EB}'), // e with diaeresis
0o306 => Some('\u{00E8}'), // e with grave
0o307 => Some('\u{00ED}'), // i with acute
0o310 => Some('\u{00EE}'), // i with circumflex
0o311 => Some('\u{00EF}'), // i with diaeresis
0o312 => Some('\u{00EC}'), // i with grave
0o313 => Some('\u{00F1}'), // n with tilde
0o314 => Some('\u{00F3}'), // o with acute
0o315 => Some('\u{00F4}'), // o with circumflex
0o316 => Some('\u{00F6}'), // o with diaeresis
0o317 => Some('\u{00F2}'), // o with grave
0o320 => Some('\u{00F8}'), // o with stroke
0o321 => Some('\u{0153}'), // oe ligature
0o322 => Some('\u{00F5}'), // o with tilde
0o323 => Some('\u{00DF}'), // Sharp s
0o324 => Some('\u{007B}'), // { (duplicate)
0o325 => Some('\u{007D}'), // } (duplicate)
0o326 => Some('\u{00A1}'), // Inverted exclamation (duplicate)
0o327 => Some('\u{00BF}'), // Inverted question mark (duplicate)
0o330 => Some('\u{0161}'), // s with caron
0o331 => Some('\u{017D}'), // Z with caron
0o332 => Some('\u{00A9}'), // Copyright
0o333 => Some('\u{00AE}'), // Registered
0o334 => Some('\u{2122}'), // Trademark (duplicate)
0o335 => Some('\u{2212}'), // Minus sign
0o336 => Some('\u{2012}'), // Figure dash
0o337 => Some('\u{0452}'), // Serbian soft sign
0o340 => Some('\u{0452}'), // Serbian soft sign (duplicate)
0o341 => Some('\u{2013}'), // En dash (duplicate)
0o342 => Some('\u{2014}'), // Em dash (duplicate)
0o343 => Some('\u{201C}'), // Double left quote (duplicate)
0o344 => Some('\u{201D}'), // Double right quote (duplicate)
0o345 => Some('\u{2018}'), // Single left quote (duplicate)
0o346 => Some('\u{2019}'), // Single right quote (duplicate)
0o347 => Some('\u{2022}'), // Bullet (duplicate)
0o350 => Some('\u{201A}'), // Single low-9 quote (duplicate)
0o351 => Some('\u{2039}'), // Single left angle quote (duplicate)
0o352 => Some('\u{203A}'), // Single right angle quote (duplicate)
0o353 => Some('\u{2026}'), // Ellipsis (duplicate)
0o354 => Some('\u{2020}'), // Dagger (duplicate)
0o355 => Some('\u{2021}'), // Double dagger (duplicate)
0o356 => Some('\u{20AC}'), // Euro sign (PDF 1.4+)
0o357 => Some('\u{2030}'), // Per mille
0o360 => Some('\u{0160}'), // S with caron
0o361 => Some('\u{017E}'), // z with caron
0o362 => Some('\u{0161}'), // s with caron (duplicate)
0o363 => Some('\u{017D}'), // Z with caron (duplicate)
0o364 => Some('\u{0178}'), // Y with diaeresis (duplicate)
0o365 => Some('\u{00A1}'), // Inverted exclamation (duplicate)
0o366 => Some('\u{00BF}'), // Inverted question mark (duplicate)
0o367 => Some('\u{2212}'), // Minus sign (duplicate)
0o370 => Some('\u{0000}'), // Should be "unused" but using null
0o371 => Some('\u{0000}'), // Should be "unused" but using null
0o372 => Some('\u{0000}'), // Should be "unused" but using null
0o373 => Some('\u{0000}'), // Should be "unused" but using null
0o374 => Some('\u{0000}'), // Should be "unused" but using null
0o375 => Some('\u{0000}'), // Should be "unused" but using null
0o376 => Some('\u{0000}'), // Should be "unused" but using null
0o377 => Some('\u{0000}'), // Should be "unused" but using null
0o010 => Some('\u{0000}'), // NUL
0o011 => Some('\u{0001}'), // SOH
0o012 => Some('\u{0002}'), // STX
0o013 => Some('\u{0003}'), // ETX
0o014 => Some('\u{0004}'), // EOT
0o015 => Some('\u{0005}'), // ENQ
0o016 => Some('\u{0006}'), // ACK
0o017 => Some('\u{0007}'), // BEL
0o020 => Some('\u{0008}'), // BS
0o021 => Some('\u{0009}'), // HT
0o022 => Some('\u{000A}'), // LF
0o023 => Some('\u{000B}'), // VT
0o024 => Some('\u{000C}'), // FF
0o025 => Some('\u{000D}'), // CR
0o026 => Some('\u{000E}'), // SO
0o027 => Some('\u{000F}'), // SI
0o030 => Some('\u{0010}'), // DLE
0o031 => Some('\u{0011}'), // DC1
0o032 => Some('\u{0012}'), // DC2
0o033 => Some('\u{0013}'), // DC3
0o034 => Some('\u{0014}'), // DC4
0o035 => Some('\u{0015}'), // NAK
0o036 => Some('\u{0016}'), // SYN
0o037 => Some('\u{0017}'), // ETB
0o040 => Some('\u{0020}'), // Space (same as Latin-1)
0o041 => Some('\u{0021}'), // !
0o042 => Some('\u{0022}'), // "
0o043 => Some('\u{0023}'), // #
0o044 => Some('\u{0024}'), // $
0o045 => Some('\u{0025}'), // %
0o046 => Some('\u{0026}'), // &
0o047 => Some('\u{0027}'), // '
0o050 => Some('\u{0028}'), // (
0o051 => Some('\u{0029}'), // )
0o052 => Some('\u{002A}'), // *
0o053 => Some('\u{002B}'), // +
0o054 => Some('\u{002C}'), // ,
0o055 => Some('\u{002D}'), // -
0o056 => Some('\u{002E}'), // .
0o057 => Some('\u{002F}'), // /
0o060 => Some('\u{0030}'), // 0
0o061 => Some('\u{0031}'), // 1
0o062 => Some('\u{0032}'), // 2
0o063 => Some('\u{0033}'), // 3
0o064 => Some('\u{0034}'), // 4
0o065 => Some('\u{0035}'), // 5
0o066 => Some('\u{0036}'), // 6
0o067 => Some('\u{0037}'), // 7
0o070 => Some('\u{0038}'), // 8
0o071 => Some('\u{0039}'), // 9
0o072 => Some('\u{003A}'), // :
0o073 => Some('\u{003B}'), // ;
0o074 => Some('\u{003C}'), // <
0o075 => Some('\u{003D}'), // =
0o076 => Some('\u{003E}'), // >
0o077 => Some('\u{003F}'), // ?
0o100 => Some('\u{0040}'), // @
0o101 => Some('\u{0041}'), // A
0o102 => Some('\u{0042}'), // B
0o103 => Some('\u{0043}'), // C
0o104 => Some('\u{0044}'), // D
0o105 => Some('\u{0045}'), // E
0o106 => Some('\u{0046}'), // F
0o107 => Some('\u{0047}'), // G
0o110 => Some('\u{0048}'), // H
0o111 => Some('\u{0049}'), // I
0o112 => Some('\u{004A}'), // J
0o113 => Some('\u{004B}'), // K
0o114 => Some('\u{004C}'), // L
0o115 => Some('\u{004D}'), // M
0o116 => Some('\u{004E}'), // N
0o117 => Some('\u{004F}'), // O
0o120 => Some('\u{0050}'), // P
0o121 => Some('\u{0051}'), // Q
0o122 => Some('\u{0052}'), // R
0o123 => Some('\u{0053}'), // S
0o124 => Some('\u{0054}'), // T
0o125 => Some('\u{0055}'), // U
0o126 => Some('\u{0056}'), // V
0o127 => Some('\u{0057}'), // W
0o130 => Some('\u{0058}'), // X
0o131 => Some('\u{0059}'), // Y
0o132 => Some('\u{005A}'), // Z
0o133 => Some('\u{005B}'), // [
0o134 => Some('\u{005C}'), // \
0o135 => Some('\u{005D}'), // ]
0o136 => Some('\u{005E}'), // ^
0o137 => Some('\u{005F}'), // _
0o140 => Some('\u{0060}'), // `
0o141 => Some('\u{0061}'), // a
0o142 => Some('\u{0062}'), // b
0o143 => Some('\u{0063}'), // c
0o144 => Some('\u{0064}'), // d
0o145 => Some('\u{0065}'), // e
0o146 => Some('\u{0066}'), // f
0o147 => Some('\u{0067}'), // g
0o150 => Some('\u{0068}'), // h
0o151 => Some('\u{0069}'), // i
0o152 => Some('\u{006A}'), // j
0o153 => Some('\u{006B}'), // k
0o154 => Some('\u{006C}'), // l
0o155 => Some('\u{006D}'), // m
0o156 => Some('\u{006E}'), // n
0o157 => Some('\u{006F}'), // o
0o160 => Some('\u{0070}'), // p
0o161 => Some('\u{0071}'), // q
0o162 => Some('\u{0072}'), // r
0o163 => Some('\u{0073}'), // s
0o164 => Some('\u{0074}'), // t
0o165 => Some('\u{0075}'), // u
0o166 => Some('\u{0076}'), // v
0o167 => Some('\u{0077}'), // w
0o170 => Some('\u{0078}'), // x
0o171 => Some('\u{0079}'), // y
0o172 => Some('\u{007A}'), // z
0o173 => Some('\u{007B}'), // {
0o174 => Some('\u{007C}'), // |
0o175 => Some('\u{007D}'), // }
0o176 => Some('\u{007E}'), // ~
0o200 => Some('\u{2022}'), // Bullet
0o201 => Some('\u{2020}'), // Dagger
0o202 => Some('\u{2021}'), // Double Dagger
0o203 => Some('\u{2026}'), // Ellipsis
0o204 => Some('\u{2014}'), // Em Dash
0o205 => Some('\u{2013}'), // En Dash
0o206 => Some('\u{0192}'), // Florin
0o207 => Some('\u{2044}'), // Fraction
0o210 => Some('\u{2039}'), // Single Left Angle Quote
0o211 => Some('\u{203A}'), // Single Right Angle Quote
0o212 => Some('\u{201C}'), // Double Left Quote
0o213 => Some('\u{201D}'), // Double Right Quote
0o214 => Some('\u{2018}'), // Single Left Quote
0o215 => Some('\u{2019}'), // Single Right Quote
0o216 => Some('\u{201A}'), // Single Low-9 Quote
0o217 => Some('\u{2122}'), // Trademark
0o220 => Some('\u{FB01}'), // fi ligature
0o221 => Some('\u{FB02}'), // fl ligature
0o222 => Some('\u{0141}'), // L with stroke
0o223 => Some('\u{0152}'), // OE ligature
0o224 => Some('\u{0133}'), // oe ligature
0o225 => Some('\u{0178}'), // Y with diaeresis
0o226 => Some('\u{00A1}'), // Inverted exclamation
0o227 => Some('\u{00BF}'), // Inverted question mark
0o230 => Some('\u{00A1}'), // Inverted exclamation (duplicate in spec)
0o231 => Some('\u{00BF}'), // Inverted question mark (duplicate in spec)
0o232 => Some('\u{00A2}'), // Cent sign
0o233 => Some('\u{00A3}'), // Pound sign
0o234 => Some('\u{00A5}'), // Yen sign
0o235 => Some('\u{20A7}'), // Peseta sign (changed in PDF 2.0, using original)
0o236 => Some('\u{0192}'), // Florin (duplicate)
0o240 => Some('\u{00E6}'), // ae ligature
0o241 => Some('\u{0153}'), // OE ligature (duplicate)
0o242 => Some('\u{0178}'), // Y with diaeresis (duplicate)
0o243 => Some('\u{00C1}'), // A with acute
0o244 => Some('\u{00C2}'), // A with circumflex
0o245 => Some('\u{00C4}'), // A with diaeresis
0o246 => Some('\u{00C0}'), // A with grave
0o247 => Some('\u{00C5}'), // A with ring
0o250 => Some('\u{00C7}'), // C with cedilla
0o251 => Some('\u{00C9}'), // E with acute
0o252 => Some('\u{00C9}'), // E with acute (duplicate, using correct value)
0o253 => Some('\u{00CA}'), // E with circumflex
0o254 => Some('\u{00CB}'), // E with diaeresis
0o255 => Some('\u{00C8}'), // E with grave
0o256 => Some('\u{00CD}'), // I with acute
0o257 => Some('\u{00CE}'), // I with circumflex
0o260 => Some('\u{00CF}'), // I with diaeresis
0o261 => Some('\u{00CC}'), // I with grave
0o262 => Some('\u{00D1}'), // N with tilde
0o263 => Some('\u{00D3}'), // O with acute
0o264 => Some('\u{00D4}'), // O with circumflex
0o265 => Some('\u{00D6}'), // O with diaeresis
0o266 => Some('\u{00D2}'), // O with grave
0o267 => Some('\u{00D8}'), // O with stroke
0o270 => Some('\u{0152}'), // OE ligature (duplicate)
0o271 => Some('\u{00D5}'), // O with tilde
0o272 => Some('\u{00D7}'), // Multiplication
0o273 => Some('\u{00F7}'), // Division
0o274 => Some('\u{0178}'), // Y with diaeresis (duplicate)
0o275 => Some('\u{00E1}'), // a with acute
0o276 => Some('\u{00E2}'), // a with circumflex
0o277 => Some('\u{00E4}'), // a with diaeresis
0o300 => Some('\u{00E0}'), // a with grave
0o301 => Some('\u{00E5}'), // a with ring
0o302 => Some('\u{00E7}'), // c with cedilla
0o303 => Some('\u{00E9}'), // e with acute
0o304 => Some('\u{00EA}'), // e with circumflex
0o305 => Some('\u{00EB}'), // e with diaeresis
0o306 => Some('\u{00E8}'), // e with grave
0o307 => Some('\u{00ED}'), // i with acute
0o310 => Some('\u{00EE}'), // i with circumflex
0o311 => Some('\u{00EF}'), // i with diaeresis
0o312 => Some('\u{00EC}'), // i with grave
0o313 => Some('\u{00F1}'), // n with tilde
0o314 => Some('\u{00F3}'), // o with acute
0o315 => Some('\u{00F4}'), // o with circumflex
0o316 => Some('\u{00F6}'), // o with diaeresis
0o317 => Some('\u{00F2}'), // o with grave
0o320 => Some('\u{00F8}'), // o with stroke
0o321 => Some('\u{0153}'), // oe ligature
0o322 => Some('\u{00F5}'), // o with tilde
0o323 => Some('\u{00DF}'), // Sharp s
0o324 => Some('\u{007B}'), // { (duplicate)
0o325 => Some('\u{007D}'), // } (duplicate)
0o326 => Some('\u{00A1}'), // Inverted exclamation (duplicate)
0o327 => Some('\u{00BF}'), // Inverted question mark (duplicate)
0o330 => Some('\u{0161}'), // s with caron
0o331 => Some('\u{017D}'), // Z with caron
0o332 => Some('\u{00A9}'), // Copyright
0o333 => Some('\u{00AE}'), // Registered
0o334 => Some('\u{2122}'), // Trademark (duplicate)
0o335 => Some('\u{2212}'), // Minus sign
0o336 => Some('\u{2012}'), // Figure dash
0o337 => Some('\u{0452}'), // Serbian soft sign
0o340 => Some('\u{0452}'), // Serbian soft sign (duplicate)
0o341 => Some('\u{2013}'), // En dash (duplicate)
0o342 => Some('\u{2014}'), // Em dash (duplicate)
0o343 => Some('\u{201C}'), // Double left quote (duplicate)
0o344 => Some('\u{201D}'), // Double right quote (duplicate)
0o345 => Some('\u{2018}'), // Single left quote (duplicate)
0o346 => Some('\u{2019}'), // Single right quote (duplicate)
0o347 => Some('\u{2022}'), // Bullet (duplicate)
0o350 => Some('\u{201A}'), // Single low-9 quote (duplicate)
0o351 => Some('\u{2039}'), // Single left angle quote (duplicate)
0o352 => Some('\u{203A}'), // Single right angle quote (duplicate)
0o353 => Some('\u{2026}'), // Ellipsis (duplicate)
0o354 => Some('\u{2020}'), // Dagger (duplicate)
0o355 => Some('\u{2021}'), // Double dagger (duplicate)
0o356 => Some('\u{20AC}'), // Euro sign (PDF 1.4+)
0o357 => Some('\u{2030}'), // Per mille
0o360 => Some('\u{0160}'), // S with caron
0o361 => Some('\u{017E}'), // z with caron
0o362 => Some('\u{0161}'), // s with caron (duplicate)
0o363 => Some('\u{017D}'), // Z with caron (duplicate)
0o364 => Some('\u{0178}'), // Y with diaeresis (duplicate)
0o365 => Some('\u{00A1}'), // Inverted exclamation (duplicate)
0o366 => Some('\u{00BF}'), // Inverted question mark (duplicate)
0o367 => Some('\u{2212}'), // Minus sign (duplicate)
0o370 => Some('\u{0000}'), // Should be "unused" but using null
0o371 => Some('\u{0000}'), // Should be "unused" but using null
0o372 => Some('\u{0000}'), // Should be "unused" but using null
0o373 => Some('\u{0000}'), // Should be "unused" but using null
0o374 => Some('\u{0000}'), // Should be "unused" but using null
0o375 => Some('\u{0000}'), // Should be "unused" but using null
0o376 => Some('\u{0000}'), // Should be "unused" but using null
0o377 => Some('\u{0000}'), // Should be "unused" but using null
_ => None,
}
}
@ -596,7 +592,10 @@ fn parse_outline_recursive(
if !visited.insert(node_ref) {
diagnostics.push(Diagnostic::with_dynamic_no_offset(
DiagCode::StructCircularRef,
format!("STRUCT_CIRCULAR_REF: Cycle detected at outline node {}", node_ref),
format!(
"STRUCT_CIRCULAR_REF: Cycle detected at outline node {}",
node_ref
),
));
return None;
}
@ -605,7 +604,10 @@ fn parse_outline_recursive(
if depth >= MAX_OUTLINE_DEPTH {
diagnostics.push(Diagnostic::with_dynamic_no_offset(
DiagCode::StructDepthExceeded,
format!("STRUCT_DEPTH_EXCEEDED: Outline depth exceeds limit of {}", MAX_OUTLINE_DEPTH),
format!(
"STRUCT_DEPTH_EXCEEDED: Outline depth exceeds limit of {}",
MAX_OUTLINE_DEPTH
),
));
return None;
}
@ -645,7 +647,10 @@ fn parse_outline_recursive(
None => {
diagnostics.push(Diagnostic::with_dynamic_no_offset(
DiagCode::StructMissingKey,
format!("STRUCT_MISSING_KEY: Outline node {} missing /Title", node_ref),
format!(
"STRUCT_MISSING_KEY: Outline node {} missing /Title",
node_ref
),
));
String::from("<missing title>")
}
@ -879,7 +884,9 @@ mod tests {
let result = decode_pdf_string(&utf16be);
assert!(result.is_err());
let diags = result.unwrap_err();
assert!(diags.iter().any(|d| d.message.contains("STRUCT_INVALID_UTF16")));
assert!(diags
.iter()
.any(|d| d.message.contains("STRUCT_INVALID_UTF16")));
}
#[test]
@ -1000,7 +1007,10 @@ mod tests {
// Create a simple outline item
let mut outline_dict = IndexMap::new();
outline_dict.insert(intern("Title"), PdfObject::String(Box::new(b"Chapter 1".to_vec())));
outline_dict.insert(
intern("Title"),
PdfObject::String(Box::new(b"Chapter 1".to_vec())),
);
outline_dict.insert(intern("Dest"), {
let mut dest = Vec::new();
dest.push(PdfObject::Ref(ObjRef::new(10, 0)));
@ -1030,7 +1040,10 @@ mod tests {
// Create an outline item with /Count
let mut outline_dict = IndexMap::new();
outline_dict.insert(intern("Title"), PdfObject::String(Box::new(b"Section".to_vec())));
outline_dict.insert(
intern("Title"),
PdfObject::String(Box::new(b"Section".to_vec())),
);
outline_dict.insert(intern("Count"), PdfObject::Integer(-3)); // Collapsed with 3 descendants
outline_dict.insert(intern("Dest"), {
let mut dest = Vec::new();
@ -1059,7 +1072,10 @@ mod tests {
// Create child outline
let mut child_dict = IndexMap::new();
child_dict.insert(intern("Title"), PdfObject::String(Box::new(b"Section 1.1".to_vec())));
child_dict.insert(
intern("Title"),
PdfObject::String(Box::new(b"Section 1.1".to_vec())),
);
child_dict.insert(intern("Dest"), {
let mut dest = Vec::new();
dest.push(PdfObject::Ref(ObjRef::new(12, 0)));
@ -1071,7 +1087,10 @@ mod tests {
// Create parent outline with /First pointing to child
let mut parent_dict = IndexMap::new();
parent_dict.insert(intern("Title"), PdfObject::String(Box::new(b"Chapter 1".to_vec())));
parent_dict.insert(
intern("Title"),
PdfObject::String(Box::new(b"Chapter 1".to_vec())),
);
parent_dict.insert(intern("First"), PdfObject::Ref(ObjRef::new(101, 0)));
parent_dict.insert(intern("Count"), PdfObject::Integer(1)); // One child
@ -1097,7 +1116,10 @@ mod tests {
// Level 3: Grandchild
let mut grandchild_dict = IndexMap::new();
grandchild_dict.insert(intern("Title"), PdfObject::String(Box::new(b"Section 1.1.1".to_vec())));
grandchild_dict.insert(
intern("Title"),
PdfObject::String(Box::new(b"Section 1.1.1".to_vec())),
);
grandchild_dict.insert(intern("Dest"), {
let mut dest = Vec::new();
dest.push(PdfObject::Ref(ObjRef::new(10, 0)));
@ -1105,11 +1127,17 @@ mod tests {
PdfObject::Array(Box::new(dest))
});
resolver.cache_object(ObjRef::new(102, 0), PdfObject::Dict(Box::new(grandchild_dict)));
resolver.cache_object(
ObjRef::new(102, 0),
PdfObject::Dict(Box::new(grandchild_dict)),
);
// Level 2: Child with /First pointing to grandchild
let mut child_dict = IndexMap::new();
child_dict.insert(intern("Title"), PdfObject::String(Box::new(b"Section 1.1".to_vec())));
child_dict.insert(
intern("Title"),
PdfObject::String(Box::new(b"Section 1.1".to_vec())),
);
child_dict.insert(intern("First"), PdfObject::Ref(ObjRef::new(102, 0)));
child_dict.insert(intern("Count"), PdfObject::Integer(1));
@ -1117,7 +1145,10 @@ mod tests {
// Level 1: Parent with /First pointing to child
let mut parent_dict = IndexMap::new();
parent_dict.insert(intern("Title"), PdfObject::String(Box::new(b"Chapter 1".to_vec())));
parent_dict.insert(
intern("Title"),
PdfObject::String(Box::new(b"Chapter 1".to_vec())),
);
parent_dict.insert(intern("First"), PdfObject::Ref(ObjRef::new(101, 0)));
parent_dict.insert(intern("Count"), PdfObject::Integer(2));
@ -1145,7 +1176,10 @@ mod tests {
// Create second sibling
let mut sibling2_dict = IndexMap::new();
sibling2_dict.insert(intern("Title"), PdfObject::String(Box::new(b"Chapter 2".to_vec())));
sibling2_dict.insert(
intern("Title"),
PdfObject::String(Box::new(b"Chapter 2".to_vec())),
);
sibling2_dict.insert(intern("Dest"), {
let mut dest = Vec::new();
dest.push(PdfObject::Ref(ObjRef::new(11, 0)));
@ -1153,11 +1187,17 @@ mod tests {
PdfObject::Array(Box::new(dest))
});
resolver.cache_object(ObjRef::new(101, 0), PdfObject::Dict(Box::new(sibling2_dict)));
resolver.cache_object(
ObjRef::new(101, 0),
PdfObject::Dict(Box::new(sibling2_dict)),
);
// Create first sibling with /Next pointing to second
let mut sibling1_dict = IndexMap::new();
sibling1_dict.insert(intern("Title"), PdfObject::String(Box::new(b"Chapter 1".to_vec())));
sibling1_dict.insert(
intern("Title"),
PdfObject::String(Box::new(b"Chapter 1".to_vec())),
);
sibling1_dict.insert(intern("Next"), PdfObject::Ref(ObjRef::new(101, 0)));
sibling1_dict.insert(intern("Dest"), {
let mut dest = Vec::new();
@ -1166,7 +1206,10 @@ mod tests {
PdfObject::Array(Box::new(dest))
});
resolver.cache_object(ObjRef::new(100, 0), PdfObject::Dict(Box::new(sibling1_dict)));
resolver.cache_object(
ObjRef::new(100, 0),
PdfObject::Dict(Box::new(sibling1_dict)),
);
// Create outlines root
let mut root_dict = IndexMap::new();
@ -1188,16 +1231,28 @@ mod tests {
// Create an outline that forms a cycle: 100 -> 101 -> 100
let mut outline1_dict = IndexMap::new();
outline1_dict.insert(intern("Title"), PdfObject::String(Box::new(b"Outline 1".to_vec())));
outline1_dict.insert(
intern("Title"),
PdfObject::String(Box::new(b"Outline 1".to_vec())),
);
outline1_dict.insert(intern("Next"), PdfObject::Ref(ObjRef::new(101, 0)));
resolver.cache_object(ObjRef::new(100, 0), PdfObject::Dict(Box::new(outline1_dict)));
resolver.cache_object(
ObjRef::new(100, 0),
PdfObject::Dict(Box::new(outline1_dict)),
);
let mut outline2_dict = IndexMap::new();
outline2_dict.insert(intern("Title"), PdfObject::String(Box::new(b"Outline 2".to_vec())));
outline2_dict.insert(
intern("Title"),
PdfObject::String(Box::new(b"Outline 2".to_vec())),
);
outline2_dict.insert(intern("Next"), PdfObject::Ref(ObjRef::new(100, 0))); // Cycle back
resolver.cache_object(ObjRef::new(101, 0), PdfObject::Dict(Box::new(outline2_dict)));
resolver.cache_object(
ObjRef::new(101, 0),
PdfObject::Dict(Box::new(outline2_dict)),
);
// Create outlines root
let mut root_dict = IndexMap::new();
@ -1208,7 +1263,9 @@ mod tests {
// Should get both outlines before detecting the cycle
assert_eq!(outlines.len(), 2);
// Should have a cycle diagnostic
assert!(diags.iter().any(|d| d.message.contains("STRUCT_CIRCULAR_REF")));
assert!(diags
.iter()
.any(|d| d.message.contains("STRUCT_CIRCULAR_REF")));
}
#[test]
@ -1236,7 +1293,9 @@ mod tests {
let (outlines, diags) = parse_outlines(&resolver, Some(ObjRef::new(99, 0)), &pages);
assert_eq!(outlines.len(), 1);
assert_eq!(outlines[0].title, "<missing title>");
assert!(diags.iter().any(|d| d.message.contains("STRUCT_MISSING_KEY")));
assert!(diags
.iter()
.any(|d| d.message.contains("STRUCT_MISSING_KEY")));
}
#[test]
@ -1257,7 +1316,10 @@ mod tests {
action_dict.insert(intern("D"), PdfObject::Array(Box::new(goto_dest)));
let mut outline_dict = IndexMap::new();
outline_dict.insert(intern("Title"), PdfObject::String(Box::new(b"GoTo Test".to_vec())));
outline_dict.insert(
intern("Title"),
PdfObject::String(Box::new(b"GoTo Test".to_vec())),
);
outline_dict.insert(intern("A"), PdfObject::Dict(Box::new(action_dict)));
resolver.cache_object(ObjRef::new(100, 0), PdfObject::Dict(Box::new(outline_dict)));
@ -1289,10 +1351,16 @@ mod tests {
// Create an outline with /A /URI action
let mut action_dict = IndexMap::new();
action_dict.insert(intern("S"), PdfObject::Name(intern("URI")));
action_dict.insert(intern("URI"), PdfObject::String(Box::new(b"https://example.com".to_vec())));
action_dict.insert(
intern("URI"),
PdfObject::String(Box::new(b"https://example.com".to_vec())),
);
let mut outline_dict = IndexMap::new();
outline_dict.insert(intern("Title"), PdfObject::String(Box::new(b"External Link".to_vec())));
outline_dict.insert(
intern("Title"),
PdfObject::String(Box::new(b"External Link".to_vec())),
);
outline_dict.insert(intern("A"), PdfObject::Dict(Box::new(action_dict)));
resolver.cache_object(ObjRef::new(100, 0), PdfObject::Dict(Box::new(outline_dict)));
@ -1306,7 +1374,9 @@ mod tests {
assert_eq!(outlines.len(), 1);
assert_eq!(outlines[0].title, "External Link");
assert_eq!(outlines[0].dest_page, None);
assert!(diags.iter().any(|d| d.message.contains("STRUCT_NON_GOTO_OUTLINE")));
assert!(diags
.iter()
.any(|d| d.message.contains("STRUCT_NON_GOTO_OUTLINE")));
}
#[test]
@ -1316,7 +1386,10 @@ mod tests {
// Create an outline with a named destination (string instead of page ref)
let mut outline_dict = IndexMap::new();
outline_dict.insert(intern("Title"), PdfObject::String(Box::new(b"Named Dest".to_vec())));
outline_dict.insert(
intern("Title"),
PdfObject::String(Box::new(b"Named Dest".to_vec())),
);
outline_dict.insert(intern("Dest"), PdfObject::Name(intern("Chapter1")));
resolver.cache_object(ObjRef::new(100, 0), PdfObject::Dict(Box::new(outline_dict)));
@ -1329,7 +1402,9 @@ mod tests {
let (outlines, diags) = parse_outlines(&resolver, Some(ObjRef::new(99, 0)), &pages);
assert_eq!(outlines.len(), 1);
assert_eq!(outlines[0].dest_page, None);
assert!(diags.iter().any(|d| d.message.contains("STRUCT_UNRESOLVED_DESTINATION")));
assert!(diags
.iter()
.any(|d| d.message.contains("STRUCT_UNRESOLVED_DESTINATION")));
}
#[test]
@ -1383,7 +1458,10 @@ mod tests {
// Create an outline with /XYZ destination where left/top/zoom are null
let mut outline_dict = IndexMap::new();
outline_dict.insert(intern("Title"), PdfObject::String(Box::new(b"Null Values".to_vec())));
outline_dict.insert(
intern("Title"),
PdfObject::String(Box::new(b"Null Values".to_vec())),
);
outline_dict.insert(intern("Dest"), {
let mut dest = Vec::new();
dest.push(PdfObject::Ref(ObjRef::new(10, 0)));

View file

@ -10,10 +10,10 @@
//! - Inheritance is "last-write-wins" at each level (child overrides parent)
//! - If a required inheritable attribute is missing and not inherited, use a safe default
use crate::parser::object::{ObjRef, PdfObject, PdfDict, intern};
use crate::diagnostics::{DiagCode, Diagnostic};
use crate::parser::object::{intern, ObjRef, PdfDict, PdfObject};
use crate::parser::resources::{merge_resources, ResourceDict};
use crate::parser::xref::XrefResolver;
use crate::diagnostics::{Diagnostic, DiagCode};
use crate::parser::resources::{ResourceDict, merge_resources};
use std::collections::HashSet;
use std::sync::Arc;
@ -156,7 +156,10 @@ fn count_pages_walk(
if depth > MAX_PAGES_DEPTH {
diagnostics.push(Diagnostic::with_dynamic_no_offset(
DiagCode::StructDepthExceeded,
format!("STRUCT_DEPTH_EXCEEDED: /Pages nesting exceeds {} levels", MAX_PAGES_DEPTH),
format!(
"STRUCT_DEPTH_EXCEEDED: /Pages nesting exceeds {} levels",
MAX_PAGES_DEPTH
),
));
return 0;
}
@ -165,7 +168,10 @@ fn count_pages_walk(
if visited.contains(&node_ref) {
diagnostics.push(Diagnostic::with_dynamic_no_offset(
DiagCode::StructCircularRef,
format!("STRUCT_CIRCULAR_REF: /Pages node {} already visited", node_ref),
format!(
"STRUCT_CIRCULAR_REF: /Pages node {} already visited",
node_ref
),
));
return 0;
}
@ -190,9 +196,7 @@ fn count_pages_walk(
}
};
let node_type = dict.get("Type")
.and_then(|o| o.as_name())
.unwrap_or("");
let node_type = dict.get("Type").and_then(|o| o.as_name()).unwrap_or("");
match node_type {
"Page" => {
@ -226,7 +230,8 @@ fn count_pages_walk(
PdfObject::Ref(ref_) => *ref_,
PdfObject::Dict(_) => {
// Direct dictionary - count as a page if it's a /Page
let kid_type = kid.as_dict()
let kid_type = kid
.as_dict()
.and_then(|d| d.get("Type"))
.and_then(|o| o.as_name())
.unwrap_or("");
@ -241,7 +246,7 @@ fn count_pages_walk(
}
total
}
_ => 0
_ => 0,
}
}
@ -297,7 +302,8 @@ pub fn flatten_page_tree(resolver: &XrefResolver, pages_ref: ObjRef) -> Result<V
};
// Extract /Count if present (for validation later)
let declared_count = pages_obj.as_dict()
let declared_count = pages_obj
.as_dict()
.and_then(|d| d.get("Count"))
.and_then(|o| o.as_int())
.unwrap_or(0);
@ -359,7 +365,10 @@ fn walk_page_tree(
if depth > MAX_PAGES_DEPTH {
diagnostics.push(Diagnostic::with_dynamic_no_offset(
DiagCode::StructDepthExceeded,
format!("STRUCT_DEPTH_EXCEEDED: /Pages nesting exceeds {} levels", MAX_PAGES_DEPTH),
format!(
"STRUCT_DEPTH_EXCEEDED: /Pages nesting exceeds {} levels",
MAX_PAGES_DEPTH
),
));
return Vec::new();
}
@ -373,9 +382,7 @@ fn walk_page_tree(
};
// Check /Type to determine if this is /Pages or /Page
let node_type = dict.get("Type")
.and_then(|o| o.as_name())
.unwrap_or("");
let node_type = dict.get("Type").and_then(|o| o.as_name()).unwrap_or("");
// Save the inherited state before merging this node's attributes
let parent_inherited = inherited.clone();
@ -423,7 +430,10 @@ fn walk_page_tree(
if visited.contains(ref_) {
diagnostics.push(Diagnostic::with_dynamic_no_offset(
DiagCode::StructCircularRef,
format!("STRUCT_CIRCULAR_REF: /Pages node {} already visited", ref_),
format!(
"STRUCT_CIRCULAR_REF: /Pages node {} already visited",
ref_
),
));
continue;
}
@ -434,7 +444,10 @@ fn walk_page_tree(
Err(e) => {
diagnostics.push(Diagnostic::with_dynamic_no_offset(
DiagCode::StructMissingKey,
format!("STRUCT_MISSING_KEY: Failed to resolve /Kids entry {}: {}", ref_, e),
format!(
"STRUCT_MISSING_KEY: Failed to resolve /Kids entry {}: {}",
ref_, e
),
));
continue;
}
@ -479,7 +492,11 @@ fn walk_page_tree(
///
/// Per PDF spec 7.7.3.4, only MediaBox, CropBox, Resources, and Rotate are inheritable.
/// This function updates the `inherited` accumulator with any values present in `dict`.
fn merge_inherited_attrs(dict: &PdfDict, inherited: &mut InheritedAttrs, diagnostics: &mut Vec<Diagnostic>) {
fn merge_inherited_attrs(
dict: &PdfDict,
inherited: &mut InheritedAttrs,
diagnostics: &mut Vec<Diagnostic>,
) {
// MediaBox (inheritable)
if let Some(mb) = parse_rect(dict.get("MediaBox")) {
inherited.media_box = Some(mb);
@ -501,7 +518,10 @@ fn merge_inherited_attrs(dict: &PdfDict, inherited: &mut InheritedAttrs, diagnos
if rot % 90 != 0 {
diagnostics.push(Diagnostic::with_dynamic_no_offset(
DiagCode::PageInvalidRotate,
format!("STRUCT_INVALID_ROTATE: /Rotate value {} is not a multiple of 90", rot),
format!(
"STRUCT_INVALID_ROTATE: /Rotate value {} is not a multiple of 90",
rot
),
));
// Clamp to nearest multiple of 90 (floor toward negative infinity)
inherited.rotate = ((rot as f64 / 90.0).floor() as i64 * 90) as i32;
@ -515,7 +535,11 @@ fn merge_inherited_attrs(dict: &PdfDict, inherited: &mut InheritedAttrs, diagnos
///
/// This function extracts all page-level attributes, substituting defaults for
/// missing values and emitting diagnostics where appropriate.
fn build_page_dict(page_obj: &PdfObject, inherited: &InheritedAttrs, diagnostics: &mut Vec<Diagnostic>) -> PageDict {
fn build_page_dict(
page_obj: &PdfObject,
inherited: &InheritedAttrs,
diagnostics: &mut Vec<Diagnostic>,
) -> PageDict {
let dict = match page_obj.as_dict() {
Some(d) => d,
None => {
@ -578,7 +602,10 @@ fn build_page_dict(page_obj: &PdfObject, inherited: &InheritedAttrs, diagnostics
diagnostics.push(Diagnostic::with_dynamic(
DiagCode::PageInvalidRotate,
0,
format!("Page {} has /Rotate value {} (not a multiple of 90)", obj_ref, rot),
format!(
"Page {} has /Rotate value {} (not a multiple of 90)",
obj_ref, rot
),
));
// Clamp to nearest multiple of 90 (floor toward negative infinity)
rotate = ((rot as f64 / 90.0).floor() as i64 * 90) as i32;
@ -602,20 +629,20 @@ fn build_page_dict(page_obj: &PdfObject, inherited: &InheritedAttrs, diagnostics
// Annots: collect array of references
let annots = if let Some(PdfObject::Array(arr)) = dict.get("Annots") {
arr.iter()
.filter_map(|o| o.as_ref())
.collect()
arr.iter().filter_map(|o| o.as_ref()).collect()
} else {
Vec::new()
};
// ActualText (from tagged PDF)
let actual_text = dict.get("ActualText")
let actual_text = dict
.get("ActualText")
.and_then(|o| o.as_string())
.and_then(|s| String::from_utf8(s.to_vec()).ok());
// Lang (language identifier)
let lang = dict.get("Lang")
let lang = dict
.get("Lang")
.and_then(|o| o.as_string())
.and_then(|s| String::from_utf8(s.to_vec()).ok());
@ -623,7 +650,8 @@ fn build_page_dict(page_obj: &PdfObject, inherited: &InheritedAttrs, diagnostics
let aa = dict.get("AA").cloned();
// StructParents: for StructTree MCID resolution (Phase 7.1.4)
let struct_parents = dict.get("StructParents")
let struct_parents = dict
.get("StructParents")
.and_then(|o| o.as_int())
.map(|i| i as i32);
@ -654,10 +682,22 @@ fn parse_rect(obj: Option<&PdfObject>) -> Option<[f64; 4]> {
return None;
}
let x1 = arr[0].as_int().map(|i| i as f64).or_else(|| arr[0].as_real())?;
let y1 = arr[1].as_int().map(|i| i as f64).or_else(|| arr[1].as_real())?;
let x2 = arr[2].as_int().map(|i| i as f64).or_else(|| arr[2].as_real())?;
let y2 = arr[3].as_int().map(|i| i as f64).or_else(|| arr[3].as_real())?;
let x1 = arr[0]
.as_int()
.map(|i| i as f64)
.or_else(|| arr[0].as_real())?;
let y1 = arr[1]
.as_int()
.map(|i| i as f64)
.or_else(|| arr[1].as_real())?;
let x2 = arr[2]
.as_int()
.map(|i| i as f64)
.or_else(|| arr[2].as_real())?;
let y2 = arr[3]
.as_int()
.map(|i| i as f64)
.or_else(|| arr[3].as_real())?;
Some([x1, y1, x2, y2])
}
@ -673,11 +713,7 @@ fn parse_contents_array(obj: Option<&PdfObject>) -> Vec<ObjRef> {
match obj {
None => Vec::new(),
Some(PdfObject::Ref(ref_)) => vec![*ref_],
Some(PdfObject::Array(arr)) => {
arr.iter()
.filter_map(|o| o.as_ref())
.collect()
}
Some(PdfObject::Array(arr)) => arr.iter().filter_map(|o| o.as_ref()).collect(),
Some(PdfObject::Stream(_)) => {
// Direct stream is illegal - should be indirect
// Return empty; diagnostics would be emitted by parser
@ -771,7 +807,10 @@ mod tests {
#[test]
fn test_parse_contents_single_ref() {
let ref_obj = PdfObject::Ref(ObjRef::new(10, 0));
assert_eq!(parse_contents_array(Some(&ref_obj)), vec![ObjRef::new(10, 0)]);
assert_eq!(
parse_contents_array(Some(&ref_obj)),
vec![ObjRef::new(10, 0)]
);
}
#[test]
@ -780,10 +819,10 @@ mod tests {
PdfObject::Ref(ObjRef::new(10, 0)),
PdfObject::Ref(ObjRef::new(11, 0)),
]));
assert_eq!(parse_contents_array(Some(&arr)), vec![
ObjRef::new(10, 0),
ObjRef::new(11, 0),
]);
assert_eq!(
parse_contents_array(Some(&arr)),
vec![ObjRef::new(10, 0), ObjRef::new(11, 0),]
);
}
#[test]
@ -831,13 +870,16 @@ mod tests {
let mut grandparent_dict = grandparent.as_dict().unwrap().clone();
grandparent_dict.insert(
intern("Kids"),
PdfObject::Array(Box::new(vec![PdfObject::Ref(parent_ref)]))
PdfObject::Array(Box::new(vec![PdfObject::Ref(parent_ref)])),
);
let mut parent_dict = parent.as_dict().unwrap().clone();
parent_dict.insert(
intern("Kids"),
PdfObject::Array(Box::new(vec![PdfObject::Ref(page1_ref), PdfObject::Ref(page2_ref)]))
PdfObject::Array(Box::new(vec![
PdfObject::Ref(page1_ref),
PdfObject::Ref(page2_ref),
])),
);
resolver.cache_object(grandparent_ref, PdfObject::Dict(Box::new(grandparent_dict)));
@ -861,11 +903,7 @@ mod tests {
let pages_ref = ObjRef::new(1, 0);
// /Pages with no MediaBox
let pages = make_pages_dict(
vec![make_page_dict(None, None)],
1,
None,
);
let pages = make_pages_dict(vec![make_page_dict(None, None)], 1, None);
resolver.cache_object(pages_ref, pages);
@ -960,7 +998,7 @@ mod tests {
// /Count says 5, but we only have 1 page
let pages = make_pages_dict(
vec![make_page_dict(Some(DEFAULT_MEDIABOX), None)],
5, // Wrong count
5, // Wrong count
Some(DEFAULT_MEDIABOX),
);
@ -992,22 +1030,31 @@ mod tests {
// Create child2 with a valid page and a reference to child1 (creating cycle)
let mut child2_dict = PdfDict::new();
child2_dict.insert(intern("Type"), PdfObject::Name(intern("Pages")));
child2_dict.insert(intern("Kids"), PdfObject::Array(Box::new(vec![
PdfObject::Ref(page_ref),
PdfObject::Ref(child1_ref), // This will cause a cycle
])));
child2_dict.insert(
intern("Kids"),
PdfObject::Array(Box::new(vec![
PdfObject::Ref(page_ref),
PdfObject::Ref(child1_ref), // This will cause a cycle
])),
);
child2_dict.insert(intern("Count"), PdfObject::Integer(2));
// Create child1 that references child2 (the other half of the cycle)
let mut child1_dict = PdfDict::new();
child1_dict.insert(intern("Type"), PdfObject::Name(intern("Pages")));
child1_dict.insert(intern("Kids"), PdfObject::Array(Box::new(vec![PdfObject::Ref(child2_ref)])));
child1_dict.insert(
intern("Kids"),
PdfObject::Array(Box::new(vec![PdfObject::Ref(child2_ref)])),
);
child1_dict.insert(intern("Count"), PdfObject::Integer(1));
// Create parent that references child1
let mut parent_dict = PdfDict::new();
parent_dict.insert(intern("Type"), PdfObject::Name(intern("Pages")));
parent_dict.insert(intern("Kids"), PdfObject::Array(Box::new(vec![PdfObject::Ref(child1_ref)])));
parent_dict.insert(
intern("Kids"),
PdfObject::Array(Box::new(vec![PdfObject::Ref(child1_ref)])),
);
parent_dict.insert(intern("Count"), PdfObject::Integer(2));
parent_dict.insert(intern("MediaBox"), make_rect_array(DEFAULT_MEDIABOX));
@ -1043,7 +1090,10 @@ mod tests {
grandparent.insert(intern("Type"), PdfObject::Name(intern("Pages")));
grandparent.insert(intern("Kids"), PdfObject::Array(Box::new(vec![])));
grandparent.insert(intern("Count"), PdfObject::Integer(2));
grandparent.insert(intern("Resources"), PdfObject::Dict(Box::new(grandparent_resources)));
grandparent.insert(
intern("Resources"),
PdfObject::Dict(Box::new(grandparent_resources)),
);
grandparent.insert(intern("MediaBox"), make_rect_array(DEFAULT_MEDIABOX));
// Parent /Pages adds /F2
@ -1057,7 +1107,10 @@ mod tests {
parent.insert(intern("Type"), PdfObject::Name(intern("Pages")));
parent.insert(intern("Kids"), PdfObject::Array(Box::new(vec![])));
parent.insert(intern("Count"), PdfObject::Integer(2));
parent.insert(intern("Resources"), PdfObject::Dict(Box::new(parent_resources)));
parent.insert(
intern("Resources"),
PdfObject::Dict(Box::new(parent_resources)),
);
// Page 1 adds /F3 and overrides /F1
let page1_ref = ObjRef::new(3, 0);
@ -1070,7 +1123,10 @@ mod tests {
let mut page1 = PdfDict::new();
page1.insert(intern("Type"), PdfObject::Name(intern("Page")));
page1.insert(intern("MediaBox"), make_rect_array(DEFAULT_MEDIABOX));
page1.insert(intern("Resources"), PdfObject::Dict(Box::new(page1_resources)));
page1.insert(
intern("Resources"),
PdfObject::Dict(Box::new(page1_resources)),
);
// Page 2 has no resources (should inherit all)
let page2_ref = ObjRef::new(4, 0);
@ -1082,13 +1138,16 @@ mod tests {
let mut grandparent_dict = grandparent.clone();
grandparent_dict.insert(
intern("Kids"),
PdfObject::Array(Box::new(vec![PdfObject::Ref(parent_ref)]))
PdfObject::Array(Box::new(vec![PdfObject::Ref(parent_ref)])),
);
let mut parent_dict = parent.clone();
parent_dict.insert(
intern("Kids"),
PdfObject::Array(Box::new(vec![PdfObject::Ref(page1_ref), PdfObject::Ref(page2_ref)]))
PdfObject::Array(Box::new(vec![
PdfObject::Ref(page1_ref),
PdfObject::Ref(page2_ref),
])),
);
resolver.cache_object(grandparent_ref, PdfObject::Dict(Box::new(grandparent_dict)));
@ -1103,18 +1162,39 @@ mod tests {
// Page 1: should have F1 (overridden), F2 (inherited), F3 (new), Im1 (inherited)
assert_eq!(pages_vec[0].resources.fonts.len(), 3);
assert_eq!(pages_vec[0].resources.fonts.get(&intern("F1")), Some(&ObjRef::new(15, 0))); // Overridden
assert_eq!(pages_vec[0].resources.fonts.get(&intern("F2")), Some(&ObjRef::new(11, 0))); // Inherited from parent
assert_eq!(pages_vec[0].resources.fonts.get(&intern("F3")), Some(&ObjRef::new(12, 0))); // New on page
assert_eq!(
pages_vec[0].resources.fonts.get(&intern("F1")),
Some(&ObjRef::new(15, 0))
); // Overridden
assert_eq!(
pages_vec[0].resources.fonts.get(&intern("F2")),
Some(&ObjRef::new(11, 0))
); // Inherited from parent
assert_eq!(
pages_vec[0].resources.fonts.get(&intern("F3")),
Some(&ObjRef::new(12, 0))
); // New on page
assert_eq!(pages_vec[0].resources.xobjects.len(), 1);
assert_eq!(pages_vec[0].resources.xobjects.get(&intern("Im1")), Some(&ObjRef::new(20, 0))); // Inherited from grandparent
assert_eq!(
pages_vec[0].resources.xobjects.get(&intern("Im1")),
Some(&ObjRef::new(20, 0))
); // Inherited from grandparent
// Page 2: should have all inherited resources (F1, F2, Im1)
assert_eq!(pages_vec[1].resources.fonts.len(), 2);
assert_eq!(pages_vec[1].resources.fonts.get(&intern("F1")), Some(&ObjRef::new(10, 0))); // From grandparent
assert_eq!(pages_vec[1].resources.fonts.get(&intern("F2")), Some(&ObjRef::new(11, 0))); // From parent
assert_eq!(
pages_vec[1].resources.fonts.get(&intern("F1")),
Some(&ObjRef::new(10, 0))
); // From grandparent
assert_eq!(
pages_vec[1].resources.fonts.get(&intern("F2")),
Some(&ObjRef::new(11, 0))
); // From parent
assert_eq!(pages_vec[1].resources.xobjects.len(), 1);
assert_eq!(pages_vec[1].resources.xobjects.get(&intern("Im1")), Some(&ObjRef::new(20, 0))); // From grandparent
assert_eq!(
pages_vec[1].resources.xobjects.get(&intern("Im1")),
Some(&ObjRef::new(20, 0))
); // From grandparent
}
#[test]
@ -1134,7 +1214,10 @@ mod tests {
parent.insert(intern("Type"), PdfObject::Name(intern("Pages")));
parent.insert(intern("Kids"), PdfObject::Array(Box::new(vec![])));
parent.insert(intern("Count"), PdfObject::Integer(2));
parent.insert(intern("Resources"), PdfObject::Dict(Box::new(parent_resources)));
parent.insert(
intern("Resources"),
PdfObject::Dict(Box::new(parent_resources)),
);
parent.insert(intern("MediaBox"), make_rect_array(DEFAULT_MEDIABOX));
// Two pages without /Resources
@ -1152,7 +1235,10 @@ mod tests {
let mut parent_dict = parent.clone();
parent_dict.insert(
intern("Kids"),
PdfObject::Array(Box::new(vec![PdfObject::Ref(page1_ref), PdfObject::Ref(page2_ref)]))
PdfObject::Array(Box::new(vec![
PdfObject::Ref(page1_ref),
PdfObject::Ref(page2_ref),
])),
);
resolver.cache_object(parent_ref, PdfObject::Dict(Box::new(parent_dict)));
@ -1166,13 +1252,22 @@ mod tests {
// Both pages should have inherited F1 from parent
assert_eq!(pages_vec[0].resources.fonts.len(), 1);
assert_eq!(pages_vec[0].resources.fonts.get(&intern("F1")), Some(&ObjRef::new(10, 0)));
assert_eq!(
pages_vec[0].resources.fonts.get(&intern("F1")),
Some(&ObjRef::new(10, 0))
);
assert_eq!(pages_vec[1].resources.fonts.len(), 1);
assert_eq!(pages_vec[1].resources.fonts.get(&intern("F1")), Some(&ObjRef::new(10, 0)));
assert_eq!(
pages_vec[1].resources.fonts.get(&intern("F1")),
Some(&ObjRef::new(10, 0))
);
// Verify Arc pointer sharing: when pages have no resources,
// they should share the same Arc instance (memory efficiency)
assert!(Arc::ptr_eq(&pages_vec[0].resources, &pages_vec[1].resources));
assert!(Arc::ptr_eq(
&pages_vec[0].resources,
&pages_vec[1].resources
));
}
#[test]
@ -1187,7 +1282,10 @@ mod tests {
root.insert(intern("Type"), PdfObject::Name(intern("Pages")));
root.insert(intern("Kids"), PdfObject::Array(Box::new(vec![])));
root.insert(intern("Count"), PdfObject::Integer(1));
root.insert(intern("Resources"), PdfObject::Dict(Box::new(root_resources)));
root.insert(
intern("Resources"),
PdfObject::Dict(Box::new(root_resources)),
);
root.insert(intern("MediaBox"), make_rect_array(DEFAULT_MEDIABOX));
// Page without /Resources
@ -1200,7 +1298,7 @@ mod tests {
let mut root_dict = root.clone();
root_dict.insert(
intern("Kids"),
PdfObject::Array(Box::new(vec![PdfObject::Ref(page_ref)]))
PdfObject::Array(Box::new(vec![PdfObject::Ref(page_ref)])),
);
resolver.cache_object(root_ref, PdfObject::Dict(Box::new(root_dict)));
@ -1253,7 +1351,10 @@ impl<'a> LazyPageIter<'a> {
/// Create a new lazy page iterator starting from the given /Pages reference.
///
/// This resolves the root /Pages node and initializes the traversal stack.
pub fn new(resolver: &'a XrefResolver, pages_ref: ObjRef) -> std::result::Result<Self, Vec<Diagnostic>> {
pub fn new(
resolver: &'a XrefResolver,
pages_ref: ObjRef,
) -> std::result::Result<Self, Vec<Diagnostic>> {
let mut visited = HashSet::new();
let mut diagnostics = Vec::new();
@ -1309,7 +1410,10 @@ impl<'a> Iterator for LazyPageIter<'a> {
if self.stack.len() > MAX_PAGES_DEPTH as usize {
self.diagnostics.push(Diagnostic::with_dynamic_no_offset(
DiagCode::StructDepthExceeded,
format!("STRUCT_DEPTH_EXCEEDED: /Pages nesting exceeds {} levels", MAX_PAGES_DEPTH),
format!(
"STRUCT_DEPTH_EXCEEDED: /Pages nesting exceeds {} levels",
MAX_PAGES_DEPTH
),
));
continue;
}
@ -1322,9 +1426,7 @@ impl<'a> Iterator for LazyPageIter<'a> {
}
};
let node_type = dict.get("Type")
.and_then(|o| o.as_name())
.unwrap_or("");
let node_type = dict.get("Type").and_then(|o| o.as_name()).unwrap_or("");
// Save the inherited state before merging this node's attributes
let parent_inherited = inherited.clone();
@ -1369,7 +1471,11 @@ impl<'a> Iterator for LazyPageIter<'a> {
// We need to push kids[kid_idx+1..] first, then process kid at kid_idx
if kid_idx + 1 < kids_array.len() {
// Clone node before moving it to avoid borrow checker error
self.stack.push((node.clone(), pages_parent_inherited.clone(), kid_idx + 1));
self.stack.push((
node.clone(),
pages_parent_inherited.clone(),
kid_idx + 1,
));
}
// Push the current kid onto stack
@ -1383,7 +1489,10 @@ impl<'a> Iterator for LazyPageIter<'a> {
if self.visited.contains(ref_) {
self.diagnostics.push(Diagnostic::with_dynamic_no_offset(
DiagCode::StructCircularRef,
format!("STRUCT_CIRCULAR_REF: /Pages node {} already visited", ref_),
format!(
"STRUCT_CIRCULAR_REF: /Pages node {} already visited",
ref_
),
));
inherited = parent_inherited;
continue;
@ -1445,12 +1554,15 @@ mod proptests {
dict.insert(intern("Kids"), PdfObject::Array(Box::new(kids)));
dict.insert(intern("Count"), PdfObject::Integer(count));
if let Some(mb) = media_box {
dict.insert(intern("MediaBox"), PdfObject::Array(Box::new(vec![
PdfObject::Real(mb[0]),
PdfObject::Real(mb[1]),
PdfObject::Real(mb[2]),
PdfObject::Real(mb[3]),
])));
dict.insert(
intern("MediaBox"),
PdfObject::Array(Box::new(vec![
PdfObject::Real(mb[0]),
PdfObject::Real(mb[1]),
PdfObject::Real(mb[2]),
PdfObject::Real(mb[3]),
])),
);
}
PdfObject::Dict(Box::new(dict))
}
@ -1460,12 +1572,15 @@ mod proptests {
let mut dict = PdfDict::new();
dict.insert(intern("Type"), PdfObject::Name(intern("Page")));
if let Some(mb) = media_box {
dict.insert(intern("MediaBox"), PdfObject::Array(Box::new(vec![
PdfObject::Real(mb[0]),
PdfObject::Real(mb[1]),
PdfObject::Real(mb[2]),
PdfObject::Real(mb[3]),
])));
dict.insert(
intern("MediaBox"),
PdfObject::Array(Box::new(vec![
PdfObject::Real(mb[0]),
PdfObject::Real(mb[1]),
PdfObject::Real(mb[2]),
PdfObject::Real(mb[3]),
])),
);
}
if let Some(rot) = rotate {
dict.insert(intern("Rotate"), PdfObject::Integer(rot));
@ -1485,36 +1600,46 @@ mod proptests {
prop::option::of(-1000i64..1000),
prop::option::of(arb_rect()),
prop::option::of(arb_rect()),
).prop_map(|(media_box, rotate, crop_box, bleed_box)| {
let mut dict = PdfDict::new();
dict.insert(intern("Type"), PdfObject::Name(intern("Page")));
dict.insert(intern("MediaBox"), PdfObject::Array(Box::new(vec![
PdfObject::Real(media_box[0]),
PdfObject::Real(media_box[1]),
PdfObject::Real(media_box[2]),
PdfObject::Real(media_box[3]),
])));
if let Some(rot) = rotate {
dict.insert(intern("Rotate"), PdfObject::Integer(rot));
}
if let Some(cb) = crop_box {
dict.insert(intern("CropBox"), PdfObject::Array(Box::new(vec![
PdfObject::Real(cb[0]),
PdfObject::Real(cb[1]),
PdfObject::Real(cb[2]),
PdfObject::Real(cb[3]),
])));
}
if let Some(bb) = bleed_box {
dict.insert(intern("BleedBox"), PdfObject::Array(Box::new(vec![
PdfObject::Real(bb[0]),
PdfObject::Real(bb[1]),
PdfObject::Real(bb[2]),
PdfObject::Real(bb[3]),
])));
}
dict
})
)
.prop_map(|(media_box, rotate, crop_box, bleed_box)| {
let mut dict = PdfDict::new();
dict.insert(intern("Type"), PdfObject::Name(intern("Page")));
dict.insert(
intern("MediaBox"),
PdfObject::Array(Box::new(vec![
PdfObject::Real(media_box[0]),
PdfObject::Real(media_box[1]),
PdfObject::Real(media_box[2]),
PdfObject::Real(media_box[3]),
])),
);
if let Some(rot) = rotate {
dict.insert(intern("Rotate"), PdfObject::Integer(rot));
}
if let Some(cb) = crop_box {
dict.insert(
intern("CropBox"),
PdfObject::Array(Box::new(vec![
PdfObject::Real(cb[0]),
PdfObject::Real(cb[1]),
PdfObject::Real(cb[2]),
PdfObject::Real(cb[3]),
])),
);
}
if let Some(bb) = bleed_box {
dict.insert(
intern("BleedBox"),
PdfObject::Array(Box::new(vec![
PdfObject::Real(bb[0]),
PdfObject::Real(bb[1]),
PdfObject::Real(bb[2]),
PdfObject::Real(bb[3]),
])),
);
}
dict
})
}
/// Strategy to generate /Pages dictionaries with direct /Kids.
@ -1527,9 +1652,10 @@ mod proptests {
dict.insert(intern("Count"), PdfObject::Integer(0));
if let Some(page) = maybe_page {
dict.insert(intern("Kids"), PdfObject::Array(Box::new(vec![
PdfObject::Dict(Box::new(page))
])));
dict.insert(
intern("Kids"),
PdfObject::Array(Box::new(vec![PdfObject::Dict(Box::new(page))])),
);
dict.insert(intern("Count"), PdfObject::Integer(1));
} else {
dict.insert(intern("Kids"), PdfObject::Array(Box::new(vec![])));

View file

@ -7,9 +7,9 @@
//! containing all resources from its ancestor /Pages nodes, with per-key
//! last-write-wins semantics at the page level.
use crate::parser::object::{ObjRef, PdfObject, PdfDict, intern};
use std::sync::Arc;
use crate::parser::object::{intern, ObjRef, PdfDict, PdfObject};
use indexmap::IndexMap;
use std::sync::Arc;
/// A merged resource dictionary for a page.
///
@ -290,8 +290,8 @@ mod tests {
assert_eq!(merged.fonts.len(), 3);
assert_eq!(merged.fonts.get(&intern("F1")), Some(&ObjRef::new(10, 0))); // Overridden
assert_eq!(merged.fonts.get(&intern("F2")), Some(&ObjRef::new(2, 0))); // Inherited
assert_eq!(merged.fonts.get(&intern("F3")), Some(&ObjRef::new(3, 0))); // New
assert_eq!(merged.fonts.get(&intern("F2")), Some(&ObjRef::new(2, 0))); // Inherited
assert_eq!(merged.fonts.get(&intern("F3")), Some(&ObjRef::new(3, 0))); // New
}
#[test]
@ -307,8 +307,14 @@ mod tests {
let merged = merge_resources(&ancestor, &PdfObject::Dict(Box::new(child_resources)));
assert_eq!(merged.xobjects.len(), 2);
assert_eq!(merged.xobjects.get(&intern("Im1")), Some(&ObjRef::new(5, 0)));
assert_eq!(merged.xobjects.get(&intern("Im2")), Some(&ObjRef::new(6, 0)));
assert_eq!(
merged.xobjects.get(&intern("Im1")),
Some(&ObjRef::new(5, 0))
);
assert_eq!(
merged.xobjects.get(&intern("Im2")),
Some(&ObjRef::new(6, 0))
);
}
#[test]
@ -321,11 +327,14 @@ mod tests {
// Inline color space array: [/CalRGB << /Gamma [1 1 1] >>]
let mut gamma_arr = PdfDict::new();
gamma_arr.insert(intern("Gamma"), PdfObject::Array(Box::new(vec![
PdfObject::Integer(1),
PdfObject::Integer(1),
PdfObject::Integer(1),
])));
gamma_arr.insert(
intern("Gamma"),
PdfObject::Array(Box::new(vec![
PdfObject::Integer(1),
PdfObject::Integer(1),
PdfObject::Integer(1),
])),
);
child_cs.insert(
intern("CS1"),

View file

@ -16,7 +16,7 @@
//! CI should run: `rg "expose_secret\(\)" crates/ --type rust` and fail the
//! build if any matches are found outside of these approved locations.
use secrecy::{SecretString, ExposeSecret};
use secrecy::{ExposeSecret, SecretString};
use sha2::{Digest, Sha256};
/// A fingerprint of a secret value for use in audit logs.
@ -91,7 +91,10 @@ mod tests {
fn test_fingerprint_display() {
let fp = SecretFingerprint::from_str("test");
let display = format!("{}", fp);
assert!(!display.contains("test"), "fingerprint doesn't contain secret");
assert!(
!display.contains("test"),
"fingerprint doesn't contain secret"
);
assert_eq!(display.len(), 64, "SHA-256 produces 64 hex chars");
}
}

File diff suppressed because it is too large Load diff

File diff suppressed because it is too large Load diff

File diff suppressed because it is too large Load diff

View file

@ -14,7 +14,7 @@
#![cfg(feature = "ocr")]
use crate::diagnostics::{Diagnostic, DiagCode};
use crate::diagnostics::{DiagCode, Diagnostic};
use image::{GrayImage, ImageBuffer, Luma};
use std::ffi::c_float;
@ -114,8 +114,8 @@ const DESKEW_MAX_RANGE_DEG: f64 = 15.0;
/// ```
pub fn deskew(image: &GrayImage) -> Result<(GrayImage, f64, Vec<Diagnostic>)> {
use leptonica_plumbing::leptonica_sys::{
pixDestroy, pixFindSkewAndDeskew, pixGetWidth, pixGetHeight, pixGetDepth,
Pix, l_float32, l_int32,
l_float32, l_int32, pixDestroy, pixFindSkewAndDeskew, pixGetDepth, pixGetHeight,
pixGetWidth, Pix,
};
let mut diagnostics = Vec::new();
@ -157,7 +157,10 @@ pub fn deskew(image: &GrayImage) -> Result<(GrayImage, f64, Vec<Diagnostic>)> {
pixDestroy(pix);
diagnostics.push(Diagnostic::with_static_no_offset(
DiagCode::ImgDeskewOutOfRange,
format!("Skew angle {}° exceeds detection range (±{}°)", angle_deg, DESKEW_MAX_RANGE_DEG),
format!(
"Skew angle {}° exceeds detection range (±{}°)",
angle_deg, DESKEW_MAX_RANGE_DEG
),
));
return Ok((image.clone(), angle_deg, diagnostics));
}
@ -180,9 +183,7 @@ pub fn deskew(image: &GrayImage) -> Result<(GrayImage, f64, Vec<Diagnostic>)> {
///
/// Creates an 8-bit grayscale Pix from the image data.
fn grayimage_to_pix(image: &GrayImage) -> Result<*mut Pix> {
use leptonica_plumbing::leptonica_sys::{
pixCreate, pixDestroy, pixGetData, Pix,
};
use leptonica_plumbing::leptonica_sys::{pixCreate, pixDestroy, pixGetData, Pix};
use std::ptr;
let width = image.width() as i32;
@ -231,7 +232,7 @@ fn grayimage_to_pix(image: &GrayImage) -> Result<*mut Pix> {
/// Expects an 8-bit grayscale Pix.
fn pix_to_grayimage(pix: *mut Pix) -> Result<GrayImage> {
use leptonica_plumbing::leptonica_sys::{
pixGetData, pixGetWidth, pixGetHeight, pixGetDepth, Pix,
pixGetData, pixGetDepth, pixGetHeight, pixGetWidth, Pix,
};
unsafe {
@ -323,7 +324,9 @@ mod tests {
let (deskewed, angle, diagnostics) = deskew(&img).expect("Deskew failed");
assert!(angle.abs() < 0.1, "Angle should be near 0°, got {}", angle);
assert!(!diagnostics.iter().any(|d| d.code == DiagCode::ImgDeskewOutOfRange));
assert!(!diagnostics
.iter()
.any(|d| d.code == DiagCode::ImgDeskewOutOfRange));
}
#[test]
@ -343,7 +346,9 @@ mod tests {
// Check that the Pix was created successfully
unsafe {
use leptonica_plumbing::leptonica_sys::{pixGetWidth, pixGetHeight, pixGetDepth, pixDestroy};
use leptonica_plumbing::leptonica_sys::{
pixDestroy, pixGetDepth, pixGetHeight, pixGetWidth,
};
assert!(!pix.is_null(), "Pix pointer should not be null");
assert_eq!(pixGetWidth(pix) as u32, img.width());
@ -445,14 +450,24 @@ mod tests {
let (deskewed, angle, diagnostics) = deskew(&skewed).expect("Deskew failed");
// The detected angle should be close to 2 degrees
assert!((angle.abs() - 2.0).abs() < 0.5, "Detected angle {} should be close to 2°", angle);
assert!(
(angle.abs() - 2.0).abs() < 0.5,
"Detected angle {} should be close to 2°",
angle
);
// After deskewing, a second pass should detect near-zero skew
let (_, second_angle, _) = deskew(&deskewed).expect("Second deskew failed");
assert!(second_angle.abs() < 0.1, "Second pass should detect near-zero skew, got {}", second_angle);
assert!(
second_angle.abs() < 0.1,
"Second pass should detect near-zero skew, got {}",
second_angle
);
// No out-of-range diagnostic for 2 degrees
assert!(!diagnostics.iter().any(|d| d.code == DiagCode::ImgDeskewOutOfRange));
assert!(!diagnostics
.iter()
.any(|d| d.code == DiagCode::ImgDeskewOutOfRange));
}
#[test]
@ -462,7 +477,11 @@ mod tests {
let (deskewed, angle, diagnostics) = deskew(&skewed).expect("Deskew failed");
// Angle should be 0.0 because we skip deskewing for angles < 0.3 deg
assert_eq!(angle, 0.0, "Angle should be 0.0 for sub-threshold skew, got {}", angle);
assert_eq!(
angle, 0.0,
"Angle should be 0.0 for sub-threshold skew, got {}",
angle
);
// Image should be unchanged (same dimensions and pixels)
assert_eq!(deskewed.dimensions(), skewed.dimensions());
@ -479,8 +498,12 @@ mod tests {
let (deskewed, angle, diagnostics) = deskew(&skewed).expect("Deskew failed");
// Should emit the out-of-range diagnostic
assert!(diagnostics.iter().any(|d| d.code == DiagCode::ImgDeskewOutOfRange),
"Should emit IMG_DESKEW_OUT_OF_RANGE for 20-degree skew");
assert!(
diagnostics
.iter()
.any(|d| d.code == DiagCode::ImgDeskewOutOfRange),
"Should emit IMG_DESKEW_OUT_OF_RANGE for 20-degree skew"
);
// Image dimensions should be preserved (may be different due to rotation padding,
// but should not be the original since pixFindSkewAndDeskew will attempt to rotate)
@ -722,8 +745,7 @@ mod tests {
// Helper to get sum from integral image
let get_sum = |integral: &[u64], x1: usize, y1: usize, x2: usize, y2: usize| -> u64 {
let w = width + 1;
integral[y2 * w + x2]
+ integral[y1 * w + x1]
integral[y2 * w + x2] + integral[y1 * w + x1]
- integral[y1 * w + x2]
- integral[y2 * w + x1]
};
@ -827,7 +849,10 @@ mod tests {
/// let original: GrayImage = // ... load image
/// let (preprocessed, diagnostics) = preprocess(&original, ImageSource::PhysicalScan)?;
/// ```
pub fn preprocess(image: &GrayImage, source: ImageSource) -> Result<(GrayImage, Vec<Diagnostic>)> {
pub fn preprocess(
image: &GrayImage,
source: ImageSource,
) -> Result<(GrayImage, Vec<Diagnostic>)> {
let mut diagnostics = Vec::new();
let mut current = image.clone();
@ -951,7 +976,11 @@ mod tests {
for y in 0..100 {
for x in 0..100 {
let pixel = binary.get_pixel(x, y)[0];
assert!(pixel == 0 || pixel == 255, "Pixel should be 0 or 255, got {}", pixel);
assert!(
pixel == 0 || pixel == 255,
"Pixel should be 0 or 255, got {}",
pixel
);
}
}
@ -978,7 +1007,11 @@ mod tests {
for y in 0..100 {
for x in 0..100 {
let pixel = binary.get_pixel(x, y)[0];
assert!(pixel == 0 || pixel == 255, "Pixel should be 0 or 255, got {}", pixel);
assert!(
pixel == 0 || pixel == 255,
"Pixel should be 0 or 255, got {}",
pixel
);
}
}
}
@ -988,58 +1021,68 @@ mod tests {
// Create an image with salt-and-pepper noise
let mut img = GrayImage::from_pixel(100, 100, Luma([128]));
// Add some noise
img.put_pixel(50, 50, Luma([0])); // pepper
img.put_pixel(50, 50, Luma([0])); // pepper
img.put_pixel(51, 50, Luma([255])); // salt
img.put_pixel(50, 51, Luma([255])); // salt
img.put_pixel(51, 51, Luma([0])); // pepper
img.put_pixel(51, 51, Luma([0])); // pepper
let denoised = denoise_median(&img);
// The noisy pixels should be closer to 128 after median filtering
let center = denoised.get_pixel(50, 50)[0];
assert!(center > 64 && center < 192, "Denoised pixel should be near middle, got {}", center);
assert!(
center > 64 && center < 192,
"Denoised pixel should be near middle, got {}",
center
);
}
#[test]
fn test_preprocess_physical_scan() {
let img = create_horizontal_lines_image();
let (preprocessed, diagnostics) = preprocess(&img, ImageSource::PhysicalScan)
.expect("Preprocess failed");
let (preprocessed, diagnostics) =
preprocess(&img, ImageSource::PhysicalScan).expect("Preprocess failed");
// Should have border padding
assert_eq!(preprocessed.width(), img.width() + 20);
assert_eq!(preprocessed.height(), img.height() + 20);
// Diagnostics should not have errors
assert!(!diagnostics.iter().any(|d| d.code == DiagCode::ImgUnsupportedFormat));
assert!(!diagnostics
.iter()
.any(|d| d.code == DiagCode::ImgUnsupportedFormat));
}
#[test]
fn test_preprocess_digital_origin() {
let img = create_horizontal_lines_image();
let (preprocessed, diagnostics) = preprocess(&img, ImageSource::DigitalOrigin)
.expect("Preprocess failed");
let (preprocessed, diagnostics) =
preprocess(&img, ImageSource::DigitalOrigin).expect("Preprocess failed");
// Should have border padding
assert_eq!(preprocessed.width(), img.width() + 20);
assert_eq!(preprocessed.height(), img.height() + 20);
// Diagnostics should not have errors
assert!(!diagnostics.iter().any(|d| d.code == DiagCode::ImgUnsupportedFormat));
assert!(!diagnostics
.iter()
.any(|d| d.code == DiagCode::ImgUnsupportedFormat));
}
#[test]
fn test_preprocess_jbig2() {
let img = create_horizontal_lines_image();
let (preprocessed, diagnostics) = preprocess(&img, ImageSource::Jbig2)
.expect("Preprocess failed");
let (preprocessed, diagnostics) =
preprocess(&img, ImageSource::Jbig2).expect("Preprocess failed");
// Should have border padding
assert_eq!(preprocessed.width(), img.width() + 20);
assert_eq!(preprocessed.height(), img.height() + 20);
// Diagnostics should not have errors
assert!(!diagnostics.iter().any(|d| d.code == DiagCode::ImgUnsupportedFormat));
assert!(!diagnostics
.iter()
.any(|d| d.code == DiagCode::ImgUnsupportedFormat));
}
#[test]
@ -1067,18 +1110,21 @@ mod tests {
/// Helper to load a fixture image.
fn load_fixture(path: &str) -> GrayImage {
image::io::Reader::with_format(std::io::Cursor::new(std::fs::read(path).unwrap()), image::ImageFormat::Png)
.decode()
.unwrap()
.to_luma8()
image::io::Reader::with_format(
std::io::Cursor::new(std::fs::read(path).unwrap()),
image::ImageFormat::Png,
)
.decode()
.unwrap()
.to_luma8()
}
#[test]
fn test_preprocess_skewed_2deg_deskews() {
// Acceptance criterion: 2-deg skewed fixture deskewed within 0.1 deg
let source = load_fixture("tests/fixtures/preprocess/skewed_2deg/source.png");
let (preprocessed, diagnostics) = preprocess(&source, ImageSource::PhysicalScan)
.expect("Preprocess failed");
let (preprocessed, diagnostics) =
preprocess(&source, ImageSource::PhysicalScan).expect("Preprocess failed");
// Should have border padding
assert_eq!(preprocessed.width(), source.width() + 20);
@ -1092,21 +1138,28 @@ mod tests {
BORDER_PADDING,
preprocessed.width() - 2 * BORDER_PADDING,
preprocessed.height() - 2 * BORDER_PADDING,
).to_image();
)
.to_image();
let (_, second_angle, _) = deskew(&cropped).expect("Second deskew failed");
assert!(second_angle.abs() < 0.1, "Second pass should detect near-zero skew, got {}", second_angle);
assert!(
second_angle.abs() < 0.1,
"Second pass should detect near-zero skew, got {}",
second_angle
);
// No errors in diagnostics
assert!(!diagnostics.iter().any(|d| d.code == DiagCode::ImgUnsupportedFormat));
assert!(!diagnostics
.iter()
.any(|d| d.code == DiagCode::ImgUnsupportedFormat));
}
#[test]
fn test_preprocess_uneven_lighting_binarizes() {
// Acceptance criterion: uneven-lighting binarized correctly
let source = load_fixture("tests/fixtures/preprocess/uneven_lighting/source.png");
let (preprocessed, diagnostics) = preprocess(&source, ImageSource::PhysicalScan)
.expect("Preprocess failed");
let (preprocessed, diagnostics) =
preprocess(&source, ImageSource::PhysicalScan).expect("Preprocess failed");
// Should have border padding
assert_eq!(preprocessed.width(), source.width() + 20);
@ -1116,20 +1169,26 @@ mod tests {
for y in BORDER_PADDING..preprocessed.height() - BORDER_PADDING {
for x in BORDER_PADDING..preprocessed.width() - BORDER_PADDING {
let pixel = preprocessed.get_pixel(x, y)[0];
assert!(pixel == 0 || pixel == 255, "Pixel should be binary (0 or 255), got {}", pixel);
assert!(
pixel == 0 || pixel == 255,
"Pixel should be binary (0 or 255), got {}",
pixel
);
}
}
// No errors in diagnostics
assert!(!diagnostics.iter().any(|d| d.code == DiagCode::ImgUnsupportedFormat));
assert!(!diagnostics
.iter()
.any(|d| d.code == DiagCode::ImgUnsupportedFormat));
}
#[test]
fn test_preprocess_clean_digital_binarizes() {
// Acceptance criterion: clean digital origin binarized with Otsu
let source = load_fixture("tests/fixtures/preprocess/clean_digital/source.png");
let (preprocessed, diagnostics) = preprocess(&source, ImageSource::DigitalOrigin)
.expect("Preprocess failed");
let (preprocessed, diagnostics) =
preprocess(&source, ImageSource::DigitalOrigin).expect("Preprocess failed");
// Should have border padding
assert_eq!(preprocessed.width(), source.width() + 20);
@ -1139,20 +1198,26 @@ mod tests {
for y in BORDER_PADDING..preprocessed.height() - BORDER_PADDING {
for x in BORDER_PADDING..preprocessed.width() - BORDER_PADDING {
let pixel = preprocessed.get_pixel(x, y)[0];
assert!(pixel == 0 || pixel == 255, "Pixel should be binary (0 or 255), got {}", pixel);
assert!(
pixel == 0 || pixel == 255,
"Pixel should be binary (0 or 255), got {}",
pixel
);
}
}
// No errors in diagnostics
assert!(!diagnostics.iter().any(|d| d.code == DiagCode::ImgUnsupportedFormat));
assert!(!diagnostics
.iter()
.any(|d| d.code == DiagCode::ImgUnsupportedFormat));
}
#[test]
fn test_preprocess_jbig2_only_pads() {
// Acceptance criterion: JBIG2 untouched except for border padding
let source = load_fixture("tests/fixtures/preprocess/jbig2_scan/source.png");
let (preprocessed, diagnostics) = preprocess(&source, ImageSource::Jbig2)
.expect("Preprocess failed");
let (preprocessed, diagnostics) =
preprocess(&source, ImageSource::Jbig2).expect("Preprocess failed");
// Should have border padding
assert_eq!(preprocessed.width(), source.width() + 20);
@ -1163,12 +1228,18 @@ mod tests {
for x in 0..source.width() {
let orig = source.get_pixel(x, y)[0];
let pad = preprocessed.get_pixel(x + BORDER_PADDING, y + BORDER_PADDING)[0];
assert_eq!(orig, pad, "JBIG2 inner pixel at ({}, {}) should match original", x, y);
assert_eq!(
orig, pad,
"JBIG2 inner pixel at ({}, {}) should match original",
x, y
);
}
}
// No errors in diagnostics
assert!(!diagnostics.iter().any(|d| d.code == DiagCode::ImgUnsupportedFormat));
assert!(!diagnostics
.iter()
.any(|d| d.code == DiagCode::ImgUnsupportedFormat));
}
#[test]
@ -1176,10 +1247,10 @@ mod tests {
// Acceptance criterion: same input -> bit-identical output
let source = load_fixture("tests/fixtures/preprocess/clean_digital/source.png");
let (result1, _) = preprocess(&source, ImageSource::DigitalOrigin)
.expect("First preprocess failed");
let (result2, _) = preprocess(&source, ImageSource::DigitalOrigin)
.expect("Second preprocess failed");
let (result1, _) =
preprocess(&source, ImageSource::DigitalOrigin).expect("First preprocess failed");
let (result2, _) =
preprocess(&source, ImageSource::DigitalOrigin).expect("Second preprocess failed");
// Compare pixel-by-pixel
assert_eq!(result1.dimensions(), result2.dimensions());
@ -1196,34 +1267,50 @@ mod tests {
fn test_preprocess_border_padding_pixel_perfect() {
// Acceptance criterion: padding adds exactly 10px on each side
let source = load_fixture("tests/fixtures/preprocess/clean_digital/source.png");
let (preprocessed, _) = preprocess(&source, ImageSource::DigitalOrigin)
.expect("Preprocess failed");
let (preprocessed, _) =
preprocess(&source, ImageSource::DigitalOrigin).expect("Preprocess failed");
// Check top border is white
for x in 0..preprocessed.width() {
for y in 0..BORDER_PADDING {
assert_eq!(preprocessed.get_pixel(x, y)[0], 255, "Top border should be white");
assert_eq!(
preprocessed.get_pixel(x, y)[0],
255,
"Top border should be white"
);
}
}
// Check bottom border is white
for x in 0..preprocessed.width() {
for y in preprocessed.height() - BORDER_PADDING..preprocessed.height() {
assert_eq!(preprocessed.get_pixel(x, y)[0], 255, "Bottom border should be white");
assert_eq!(
preprocessed.get_pixel(x, y)[0],
255,
"Bottom border should be white"
);
}
}
// Check left border is white
for y in 0..preprocessed.height() {
for x in 0..BORDER_PADDING {
assert_eq!(preprocessed.get_pixel(x, y)[0], 255, "Left border should be white");
assert_eq!(
preprocessed.get_pixel(x, y)[0],
255,
"Left border should be white"
);
}
}
// Check right border is white
for y in 0..preprocessed.height() {
for x in preprocessed.width() - BORDER_PADDING..preprocessed.width() {
assert_eq!(preprocessed.get_pixel(x, y)[0], 255, "Right border should be white");
assert_eq!(
preprocessed.get_pixel(x, y)[0],
255,
"Right border should be white"
);
}
}
}
@ -1267,8 +1354,8 @@ mod benches {
let img = create_a4_test_image();
let start = Instant::now();
let (result, diagnostics) = preprocess(&img, ImageSource::PhysicalScan)
.expect("Preprocess failed");
let (result, diagnostics) =
preprocess(&img, ImageSource::PhysicalScan).expect("Preprocess failed");
let elapsed = start.elapsed();
println!("A4 (2480x3508) PhysicalScan preprocess time: {:?}", elapsed);
@ -1292,11 +1379,13 @@ mod benches {
let img = create_a4_test_image();
let start = Instant::now();
let (result, _) = preprocess(&img, ImageSource::DigitalOrigin)
.expect("Preprocess failed");
let (result, _) = preprocess(&img, ImageSource::DigitalOrigin).expect("Preprocess failed");
let elapsed = start.elapsed();
println!("A4 (2480x3508) DigitalOrigin preprocess time: {:?}", elapsed);
println!(
"A4 (2480x3508) DigitalOrigin preprocess time: {:?}",
elapsed
);
assert_eq!(result.width(), A4_WIDTH + 20);
assert_eq!(result.height(), A4_HEIGHT + 20);
@ -1313,8 +1402,7 @@ mod benches {
let img = create_a4_test_image();
let start = Instant::now();
let (result, _) = preprocess(&img, ImageSource::Jbig2)
.expect("Preprocess failed");
let (result, _) = preprocess(&img, ImageSource::Jbig2).expect("Preprocess failed");
let elapsed = start.elapsed();
println!("A4 (2480x3508) Jbig2 preprocess time: {:?}", elapsed);

View file

@ -67,7 +67,8 @@ mod tests {
fn test_lite_size_benchmark() {
// Benchmark: verify receipt sizes are reasonable
// In a real document, all receipts share the same pdf_fingerprint
let pdf_fingerprint = "pdftract-v1:a7f3b8c4d2e1f6a9b5c3d8e7f4a2b1c9d6e3f8a7b4c2d9e6f3a8b7c4d1e9f6a3b8";
let pdf_fingerprint =
"pdftract-v1:a7f3b8c4d2e1f6a9b5c3d8e7f4a2b1c9d6e3f8a7b4c2d9e6f3a8b7c4d1e9f6a3b8";
let mut total_size = 0;
for i in 0..100 {

Some files were not shown because too many files have changed in this diff Show more