The indent trigger was using .abs() which fired on both increased indent (non-indented → indented) AND decreased indent (indented → non-indented). This caused drop-cap style paragraphs (indented first line, flush-left continuation) to incorrectly split into two blocks. Per plan Phase 4.4 heuristic #2, indent change should only trigger when the current line is MORE indented (to the right, larger x0) than the block average - i.e., a new paragraph starting after non-indented text. It should NOT trigger for decreased indent (first line indented, rest flush-left). Fix: Remove .abs() and only check if line_x0 - block_avg_x0 > threshold. Tests: - test_indented_first_line_new_block: PASS (non-indented → indented splits) - test_indented_first_line_of_paragraph_not_split: PASS (drop cap stays together) - All 179 line module tests: PASS
21 lines
914 B
Rust
21 lines
914 B
Rust
// Quick test to understand serialization format
|
|
use pdftract_core::fingerprint::canonicalize::{serialize_dict_canonical, serialize_object_canonical};
|
|
use pdftract_core::types::objects::{PdfDict, PdfObject};
|
|
use std::sync::Arc;
|
|
|
|
fn main() {
|
|
let mut dict = PdfDict::new();
|
|
dict.insert(Arc::from("/Z"), PdfObject::Integer(3));
|
|
dict.insert(Arc::from("/A"), PdfObject::Integer(1));
|
|
dict.insert(Arc::from("/M"), PdfObject::Integer(2));
|
|
|
|
let bytes = serialize_dict_canonical(&dict);
|
|
println!("serialize_dict_canonical output: {}", String::from_utf8_lossy(&bytes));
|
|
println!("bytes: {:?}", bytes);
|
|
|
|
println!("\n--- serialize_object_canonical ---");
|
|
let mut result = Vec::new();
|
|
serialize_object_canonical(&mut result, &PdfObject::Dict(Box::new(dict)));
|
|
println!("serialize_object_canonical output: {}", String::from_utf8_lossy(&result));
|
|
println!("bytes: {:?}", result);
|
|
}
|