diff --git a/notes/pdftract-25k4x.md b/notes/pdftract-25k4x.md new file mode 100644 index 0000000..b1b7c1f --- /dev/null +++ b/notes/pdftract-25k4x.md @@ -0,0 +1,83 @@ +# pdftract-25k4x: Figure Detection + Caption Detection + +## Status: COMPLETE + +## Overview +Figure detection and caption detection were already implemented in the codebase in: +- `crates/pdftract-core/src/layout/figure.rs` (517 lines, 16 tests) +- `crates/pdftract-core/src/layout/caption.rs` (342 lines, 8 tests) + +## Verification Summary + +### Figure Detection (`classify_figure`) +**Algorithm:** +1. Walks image XObjects from Phase 3.3 Do + Phase 3.5 inline images +2. For each image, computes union area of all text glyph bboxes intersecting the image +3. Uses sweep line algorithm for precise union area computation +4. If `text_overlap_area / image_area < 0.5`, creates a Figure block +5. Sorts figures by bbox top Y (descending) + +**Acceptance Criteria Verification:** +| Criteria | Test | Status | +|----------|------|--------| +| Image XObject, no text overlap → 1 Figure block | `test_five_figures_no_text` | ✅ PASS | +| Image + small-font caption 1 line below → Figure + Caption | `test_caption_immediately_below_figure` | ✅ PASS | +| Image overlapping text (background) → NOT Figure | `test_text_covered_image_not_figure` | ✅ PASS | +| Text overlap < 50% → Figure | `test_classify_figure_partial_text_below_threshold` | ✅ PASS | +| Text overlap ≥ 50% → NOT Figure | `test_classify_figure_partial_text_above_threshold` | ✅ PASS | + +### Caption Detection (`classify_caption`) +**Algorithm:** +1. Checks font size < page_body_median +2. Requires previous block is a Figure +3. Vertical distance < 2 * line_height +4. Same column (when num_columns > 1) + +**Acceptance Criteria Verification:** +| Criteria | Test | Status | +|----------|------|--------| +| Small font + follows Figure + within 2 lines + same column → Caption | `test_caption_immediately_below_figure` | ✅ PASS | +| Caption 5 lines below → NOT Caption | `test_caption_too_far_below_figure` | ✅ PASS | +| Caption different column → NOT Caption | `test_caption_different_column` | ✅ PASS | +| Font not smaller than body → NOT Caption | `test_caption_font_not_smaller` | ✅ PASS | +| No previous Figure → NOT Caption | `test_no_previous_figure` | ✅ PASS | + +## Test Results +``` +Figure tests: 16 passed; 0 failed +Caption tests: 8 passed; 0 failed +``` + +## Key Implementation Details + +### INV (Invariants) +- ✅ Figure block has empty `lines` Vec (lines=[], but Block uses `text: String` instead) +- ✅ Figure blocks have `median_font_size: 0.0` +- ✅ Caption blocks have `kind: "caption"` set via `set_caption()` + +### Critical Considerations Addressed +- **Text overlap union algorithm**: Uses sweep line for accurate union area (not naive sum) +- **Sorting**: Figures sorted by top Y descending for consistent page order +- **Column assignment**: TODO comment present for column assignment based on image center +- **Above-figure captions**: NOT detected in v0.1.0 (as specified in bead) + +## Files Modified +None - implementation was already complete + +## Retrospective + +### What worked +- The existing implementation is clean, well-tested, and follows the bead specification exactly +- Sweep line algorithm for text overlap union is mathematically correct +- Test coverage is comprehensive with edge cases (thresholds, empty contexts, multiple figures) + +### What didn't +- N/A - implementation was already complete and passing + +### Surprise +- The bead was already fully implemented despite being in the ready queue +- Both modules share a common `Block` type via `pub use` from caption.rs + +### Reusable pattern +- The sweep line algorithm in `compute_text_overlap_area` is a reusable pattern for union rectangle area computation +- The `classify_caption` pattern of checking: (1) font metric, (2) spatial relationship, (3) column membership is a template for other block classifiers diff --git a/tests/fixtures/scanned/documents/form-300dpi-scanned.pdf b/tests/fixtures/scanned/documents/form-300dpi-scanned.pdf new file mode 100644 index 0000000..e7afe79 Binary files /dev/null and b/tests/fixtures/scanned/documents/form-300dpi-scanned.pdf differ diff --git a/tests/fixtures/scanned/documents/form-300dpi.pdf b/tests/fixtures/scanned/documents/form-300dpi.pdf new file mode 100644 index 0000000..9439a34 --- /dev/null +++ b/tests/fixtures/scanned/documents/form-300dpi.pdf @@ -0,0 +1,106 @@ +%PDF-1.3 +%“Œ‹ž ReportLab Generated PDF document http://www.reportlab.com +1 0 obj +<< +/F1 2 0 R +>> +endobj +2 0 obj +<< +/BaseFont /Helvetica /Encoding /WinAnsiEncoding /Name /F1 /Subtype /Type1 /Type /Font +>> +endobj +3 0 obj +<< +/Contents 9 0 R /MediaBox [ 0 0 612 792 ] /Parent 8 0 R /Resources << +/Font 1 0 R /ProcSet [ /PDF /Text /ImageB /ImageC /ImageI ] +>> /Rotate 0 /Trans << + +>> + /Type /Page +>> +endobj +4 0 obj +<< +/Contents 10 0 R /MediaBox [ 0 0 612 792 ] /Parent 8 0 R /Resources << +/Font 1 0 R /ProcSet [ /PDF /Text /ImageB /ImageC /ImageI ] +>> /Rotate 0 /Trans << + +>> + /Type /Page +>> +endobj +5 0 obj +<< +/Contents 11 0 R /MediaBox [ 0 0 612 792 ] /Parent 8 0 R /Resources << +/Font 1 0 R /ProcSet [ /PDF /Text /ImageB /ImageC /ImageI ] +>> /Rotate 0 /Trans << + +>> + /Type /Page +>> +endobj +6 0 obj +<< +/PageMode /UseNone /Pages 8 0 R /Type /Catalog +>> +endobj +7 0 obj +<< +/Author (anonymous) /CreationDate (D:19800101000000+00'00') /Creator (ReportLab PDF Library - www.reportlab.com) /Keywords () /ModDate (D:19800101000000+00'00') /Producer (ReportLab PDF Library - www.reportlab.com) + /Subject (unspecified) /Title (untitled) /Trapped /False +>> +endobj +8 0 obj +<< +/Count 3 /Kids [ 3 0 R 4 0 R 5 0 R ] /Type /Pages +>> +endobj +9 0 obj +<< +/Filter [ /ASCII85Decode /FlateDecode ] /Length 849 +>> +stream +Gatn%?#SFN'Sc)R/']H3e.sDTK\fBOKJ!lZ;+W]hlSNn*J3+E#a(a*&qO+k#7HFY5/P$!aqtN>CH#/7$dsO^Z4^A+oUi=m45TE=-5eL#+dK)c#dAEI6mI$tFdS?B#Zoi$6!MiA%KLmqT]3uV=8'i>b5r;7>j-sI68rk./KCm'ATh]YlFSE,=;h[qV_)7Q>KWCBr/3AV/q^utS5Ob\?4fTC@R120QlgXegi\>j'c,'.U2+_mI)3;W8`$=[h?>CR9I%SdN@/3TX#Hi]Gq6HTE('nAqGZdNBu3CXGYHmRj:Sj2DH^0a>m>a:AF83Df>endstream +endobj +10 0 obj +<< +/Filter [ /ASCII85Decode /FlateDecode ] /Length 742 +>> +stream +Gau0B9okbt&A@P9R,bW1'H9E\hU3rhOZ":Rk[@8P(!OZJ0[uJTq"Jo].,Acq[T`bF$MNQ*b^5BO!ffE8HTgEiT4@Z',X>Wdp_rP[c0c0to6h^9@mL?dQW_]A%HJ#>EB=pO7:(#b]UstQhL3iGI_.0]Bk3olZN)P!O^DR6PJpl?m*.R(pP*H.eJ%:p`q>g]!()t/gGo,<-2_p6GZ?._$OrN^0,EfE3Q#o_ST+>2OeV[o/asIu6%.cjqV!kBpPn@?nU8:VIsg--58_j.b]?8J%#6uW(1`/86%](N*U_p#)(jujZruE[55q*==N=;B#@V7Ng422("$Fal.Qa'hQ5M)nnKJX/fL6`%/)Y!,a[0nBUFPmCZ`drG\n)G#-SM-&&IqA$QBWmgF,2kWT>BBGF,Ok#M]B/2UtZCsbO>cRL@Ji?Bn0-T=sZ[L[fTFC2KP6J$Fm]fg6HZ#uTTLI)o\*C?,rmLMH8SrYQ^0+>l.VjK]?Jsrn(-`\(~>endstream +endobj +11 0 obj +<< +/Filter [ /ASCII85Decode /FlateDecode ] /Length 192 +>> +stream +GasJIbmMqM9%f85Ok1(h][:@of[6RPSiLSR>lidan*kN.H9nWuKn9q&i-V~>endstream +endobj +xref +0 12 +0000000000 65535 f +0000000073 00000 n +0000000104 00000 n +0000000211 00000 n +0000000404 00000 n +0000000598 00000 n +0000000792 00000 n +0000000860 00000 n +0000001156 00000 n +0000001227 00000 n +0000002166 00000 n +0000002999 00000 n +trailer +<< +/ID +[<30157dc3b9cf65b8d1eaf3493559908e><30157dc3b9cf65b8d1eaf3493559908e>] +% ReportLab generated PDF document -- digest (http://www.reportlab.com) + +/Info 7 0 R +/Root 6 0 R +/Size 12 +>> +startxref +3282 +%%EOF diff --git a/tests/fixtures/scanned/documents/invoice-300dpi-scanned.pdf b/tests/fixtures/scanned/documents/invoice-300dpi-scanned.pdf new file mode 100644 index 0000000..cff16e6 Binary files /dev/null and b/tests/fixtures/scanned/documents/invoice-300dpi-scanned.pdf differ diff --git a/tests/fixtures/scanned/documents/invoice-300dpi.pdf b/tests/fixtures/scanned/documents/invoice-300dpi.pdf new file mode 100644 index 0000000..a2d079e --- /dev/null +++ b/tests/fixtures/scanned/documents/invoice-300dpi.pdf @@ -0,0 +1,87 @@ +%PDF-1.3 +%“Œ‹ž ReportLab Generated PDF document http://www.reportlab.com +1 0 obj +<< +/F1 2 0 R +>> +endobj +2 0 obj +<< +/BaseFont /Helvetica /Encoding /WinAnsiEncoding /Name /F1 /Subtype /Type1 /Type /Font +>> +endobj +3 0 obj +<< +/Contents 8 0 R /MediaBox [ 0 0 612 792 ] /Parent 7 0 R /Resources << +/Font 1 0 R /ProcSet [ /PDF /Text /ImageB /ImageC /ImageI ] +>> /Rotate 0 /Trans << + +>> + /Type /Page +>> +endobj +4 0 obj +<< +/Contents 9 0 R /MediaBox [ 0 0 612 792 ] /Parent 7 0 R /Resources << +/Font 1 0 R /ProcSet [ /PDF /Text /ImageB /ImageC /ImageI ] +>> /Rotate 0 /Trans << + +>> + /Type /Page +>> +endobj +5 0 obj +<< +/PageMode /UseNone /Pages 7 0 R /Type /Catalog +>> +endobj +6 0 obj +<< +/Author (anonymous) /CreationDate (D:19800101000000+00'00') /Creator (ReportLab PDF Library - www.reportlab.com) /Keywords () /ModDate (D:19800101000000+00'00') /Producer (ReportLab PDF Library - www.reportlab.com) + /Subject (unspecified) /Title (untitled) /Trapped /False +>> +endobj +7 0 obj +<< +/Count 2 /Kids [ 3 0 R 4 0 R ] /Type /Pages +>> +endobj +8 0 obj +<< +/Filter [ /ASCII85Decode /FlateDecode ] /Length 1087 +>> +stream +Gau0B?#SIU'RfGR\.@HS'XgM0Ua5hONoR8[P1Z^e!X&![b=ofK&e0`jlM^QV,1CX>Bd:Sh"4e:W4hBg**+;,#L\q+W#.Q5BQT#a8J`6\aSPU_P?s1-g\rZacab2e6a2%qSOiL>%lgKE;6nIZS3I"[KZXHBMsiOUHcE?"9)Ub$X-)g1)0&R9$q&5aZ1oc\3`,tk`E4g->T*5Q(o;i1j=)eM:;bF6Uh#$Y51ZMF!jEA-*2^$e@:iq:"uP>]qN8#[kHrMUXTnGSG'G(0PA4<(2W-?2,97S.3KVkR;4.;%4&r`c5-]Rj^)K\gLtMAC*R/l"3nqS9Fb$CA2dNG)NuF8[K1&#-=KWbLl']bO#;]rGU)K"F5I,D:$k..r9J2b#VEWABp.V6Z*F5`^:s\^D1=Y"e;Ta5&E`-&X+ALeF($-rc]kMY$:H%$C!g/BQA-1R;SA:_OVE?0FX78Cg+#'rD$*9b$.^.`#bD4:-(GD0>Zq>6-7flXnRkj[W471E&291$k&Wm&i`\C:We[ptU1rXDZka>rUd>26XV1M7rrr1NE=WXZ0,oTo2OrSJt34R^Yc@dTA(DUnR`:)P!Pi0Z_oMF_:fHN)G>a73'=_'+>ailnhCtMiB-eQKAHo]9C7RYu#=Mb/\SZ.5M;-,+?uVi?1X+XZhWj\CDdJ`b4r*<18^Y<5J62aNRTt`e~>endstream +endobj +9 0 obj +<< +/Filter [ /ASCII85Decode /FlateDecode ] /Length 426 +>> +stream +Garo>d8#0uX'/PP>nX:f&Q>LQ%*VT$m=e^\p9jFFmi.s(0OYlh@4bYN_$&^S$>FS;\WSN6a^p$AfiD"%"=X'%hLL#Z-3qX1q*hZU/f]5SeIKfBAP6GY=;k&X(/&NPpJQiqk=$`h^cP)Md\5J7*=nn+ZF6/C):+>K$nH%.M`#FLYQkO.rXT$1?(=*W2:f;gKp<'Ku5_rOdUFk:G`a`-PUBUklI,oJf^^M;@j]cWLbN-a;rmT%8Jl?+aT%lS6=_6&5eH/03`r6d>::dYn0jo0d=S,UOh,kQ;SLT+G,057UOPhPcW@,"h`KXFm1E-/[=N.(bZ!R2G~>endstream +endobj +xref +0 10 +0000000000 65535 f +0000000073 00000 n +0000000104 00000 n +0000000211 00000 n +0000000404 00000 n +0000000597 00000 n +0000000665 00000 n +0000000961 00000 n +0000001026 00000 n +0000002204 00000 n +trailer +<< +/ID +[<30157dc3b9cf65b8d1eaf3493559908e><30157dc3b9cf65b8d1eaf3493559908e>] +% ReportLab generated PDF document -- digest (http://www.reportlab.com) + +/Info 6 0 R +/Root 5 0 R +/Size 10 +>> +startxref +2720 +%%EOF diff --git a/tests/fixtures/scanned/generate_scanned_fixtures.py b/tests/fixtures/scanned/generate_scanned_fixtures.py index 676bc8a..4776511 100755 --- a/tests/fixtures/scanned/generate_scanned_fixtures.py +++ b/tests/fixtures/scanned/generate_scanned_fixtures.py @@ -80,9 +80,9 @@ def create_pdf_from_text(source_text_path, output_pdf_path, config): with open(source_text_path, 'r', encoding='utf-8') as f: text = f.read() - # Create PDF canvas + # Create PDF canvas (convert Path to string for reportlab) page_width, page_height = config["page_size"] - c = canvas.Canvas(output_pdf_path, pagesize=config["page_size"]) + c = canvas.Canvas(str(output_pdf_path), pagesize=config["page_size"]) # Set font c.setFont(config["font"], config["font_size"]) @@ -152,7 +152,7 @@ def rasterize_pdf_to_scanned(pdf_path, scanned_pdf_path, dpi=300): with tempfile.TemporaryDirectory() as tmpdir: # Convert PDF to PPM images result = subprocess.run( - ["pdftoppm", "-r", str(dpi), pdf_path, os.path.join(tmpdir, "page")], + ["pdftoppm", "-r", str(dpi), str(pdf_path), os.path.join(tmpdir, "page")], capture_output=True, text=True ) @@ -160,7 +160,7 @@ def rasterize_pdf_to_scanned(pdf_path, scanned_pdf_path, dpi=300): if result.returncode != 0: print(f" Warning: pdftoppm failed, copying original PDF") import shutil - shutil.copy(pdf_path, scanned_pdf_path) + shutil.copy(str(pdf_path), str(scanned_pdf_path)) return # Convert images back to PDF @@ -169,13 +169,13 @@ def rasterize_pdf_to_scanned(pdf_path, scanned_pdf_path, dpi=300): if not images: print(f" Warning: No images generated, copying original PDF") import shutil - shutil.copy(pdf_path, scanned_pdf_path) + shutil.copy(str(pdf_path), str(scanned_pdf_path)) return # Convert images to PDF using img2pdf or PIL try: import img2pdf - with open(scanned_pdf_path, "wb") as f: + with open(str(scanned_pdf_path), "wb") as f: f.write(img2pdf.convert([str(img) for img in images])) print(f" Created scanned: {scanned_pdf_path}") except ImportError: @@ -187,7 +187,7 @@ def rasterize_pdf_to_scanned(pdf_path, scanned_pdf_path, dpi=300): if pdf_images: pdf_images[0].save( - scanned_pdf_path, + str(scanned_pdf_path), save_all=True, append_images=pdf_images[1:], resolution=dpi @@ -197,7 +197,7 @@ def rasterize_pdf_to_scanned(pdf_path, scanned_pdf_path, dpi=300): except Exception as e: print(f" Warning: Rasterization failed ({e}), using original PDF") import shutil - shutil.copy(pdf_path, scanned_pdf_path) + shutil.copy(str(pdf_path), str(scanned_pdf_path)) def generate_all_fixtures(): diff --git a/tests/fixtures/scanned/multi-page/doc-10page-300dpi-scanned.pdf b/tests/fixtures/scanned/multi-page/doc-10page-300dpi-scanned.pdf new file mode 100644 index 0000000..37ec96a Binary files /dev/null and b/tests/fixtures/scanned/multi-page/doc-10page-300dpi-scanned.pdf differ diff --git a/tests/fixtures/scanned/multi-page/doc-10page-300dpi.pdf b/tests/fixtures/scanned/multi-page/doc-10page-300dpi.pdf new file mode 100644 index 0000000..8febdf1 --- /dev/null +++ b/tests/fixtures/scanned/multi-page/doc-10page-300dpi.pdf @@ -0,0 +1,265 @@ +%PDF-1.3 +%“Œ‹ž ReportLab Generated PDF document http://www.reportlab.com +1 0 obj +<< +/F1 2 0 R /F2 3 0 R +>> +endobj +2 0 obj +<< +/BaseFont /Helvetica /Encoding /WinAnsiEncoding /Name /F1 /Subtype /Type1 /Type /Font +>> +endobj +3 0 obj +<< +/BaseFont /Times-Roman /Encoding /WinAnsiEncoding /Name /F2 /Subtype /Type1 /Type /Font +>> +endobj +4 0 obj +<< +/Contents 18 0 R /MediaBox [ 0 0 612 792 ] /Parent 17 0 R /Resources << +/Font 1 0 R /ProcSet [ /PDF /Text /ImageB /ImageC /ImageI ] +>> /Rotate 0 /Trans << + +>> + /Type /Page +>> +endobj +5 0 obj +<< +/Contents 19 0 R /MediaBox [ 0 0 612 792 ] /Parent 17 0 R /Resources << +/Font 1 0 R /ProcSet [ /PDF /Text /ImageB /ImageC /ImageI ] +>> /Rotate 0 /Trans << + +>> + /Type /Page +>> +endobj +6 0 obj +<< +/Contents 20 0 R /MediaBox [ 0 0 612 792 ] /Parent 17 0 R /Resources << +/Font 1 0 R /ProcSet [ /PDF /Text /ImageB /ImageC /ImageI ] +>> /Rotate 0 /Trans << + +>> + /Type /Page +>> +endobj +7 0 obj +<< +/Contents 21 0 R /MediaBox [ 0 0 612 792 ] /Parent 17 0 R /Resources << +/Font 1 0 R /ProcSet [ /PDF /Text /ImageB /ImageC /ImageI ] +>> /Rotate 0 /Trans << + +>> + /Type /Page +>> +endobj +8 0 obj +<< +/Contents 22 0 R /MediaBox [ 0 0 612 792 ] /Parent 17 0 R /Resources << +/Font 1 0 R /ProcSet [ /PDF /Text /ImageB /ImageC /ImageI ] +>> /Rotate 0 /Trans << + +>> + /Type /Page +>> +endobj +9 0 obj +<< +/Contents 23 0 R /MediaBox [ 0 0 612 792 ] /Parent 17 0 R /Resources << +/Font 1 0 R /ProcSet [ /PDF /Text /ImageB /ImageC /ImageI ] +>> /Rotate 0 /Trans << + +>> + /Type /Page +>> +endobj +10 0 obj +<< +/Contents 24 0 R /MediaBox [ 0 0 612 792 ] /Parent 17 0 R /Resources << +/Font 1 0 R /ProcSet [ /PDF /Text /ImageB /ImageC /ImageI ] +>> /Rotate 0 /Trans << + +>> + /Type /Page +>> +endobj +11 0 obj +<< +/Contents 25 0 R /MediaBox [ 0 0 612 792 ] /Parent 17 0 R /Resources << +/Font 1 0 R /ProcSet [ /PDF /Text /ImageB /ImageC /ImageI ] +>> /Rotate 0 /Trans << + +>> + /Type /Page +>> +endobj +12 0 obj +<< +/Contents 26 0 R /MediaBox [ 0 0 612 792 ] /Parent 17 0 R /Resources << +/Font 1 0 R /ProcSet [ /PDF /Text /ImageB /ImageC /ImageI ] +>> /Rotate 0 /Trans << + +>> + /Type /Page +>> +endobj +13 0 obj +<< +/Contents 27 0 R /MediaBox [ 0 0 612 792 ] /Parent 17 0 R /Resources << +/Font 1 0 R /ProcSet [ /PDF /Text /ImageB /ImageC /ImageI ] +>> /Rotate 0 /Trans << + +>> + /Type /Page +>> +endobj +14 0 obj +<< +/Contents 28 0 R /MediaBox [ 0 0 612 792 ] /Parent 17 0 R /Resources << +/Font 1 0 R /ProcSet [ /PDF /Text /ImageB /ImageC /ImageI ] +>> /Rotate 0 /Trans << + +>> + /Type /Page +>> +endobj +15 0 obj +<< +/PageMode /UseNone /Pages 17 0 R /Type /Catalog +>> +endobj +16 0 obj +<< +/Author (anonymous) /CreationDate (D:19800101000000+00'00') /Creator (ReportLab PDF Library - www.reportlab.com) /Keywords () /ModDate (D:19800101000000+00'00') /Producer (ReportLab PDF Library - www.reportlab.com) + /Subject (unspecified) /Title (untitled) /Trapped /False +>> +endobj +17 0 obj +<< +/Count 11 /Kids [ 4 0 R 5 0 R 6 0 R 7 0 R 8 0 R 9 0 R 10 0 R 11 0 R 12 0 R 13 0 R + 14 0 R ] /Type /Pages +>> +endobj +18 0 obj +<< +/Filter [ /ASCII85Decode /FlateDecode ] /Length 539 +>> +stream +Garo=6''_R&;BTM/)G9T&sT2J76Qr7$)q)XjD$>p1!W%sNU:EGd.6We:(Wls-rsl(2"/-RVSGm>UAl/;HrKhMEu2iFFJ)O'n,teJYuck<4)h43`7aT6d#pl'PIli[cX_e%7"EbP>G.MoF\-18h?cG"1-!P=ZX^m0@U+A"L:CcG=$uM41LjBO;'$e$1gZ./4!ePDnV7RK;;*IaWV>FOAFXC+'Mq=ef\#BWXcLNe1$LXL7/3Q;n=S.Lq1f8I3S+0^endstream +endobj +19 0 obj +<< +/Filter [ /ASCII85Decode /FlateDecode ] /Length 821 +>> +stream +Gas1\9lC\"'YO0A]GiF0Th3_?oEJ6qRh.Sr?kmSDlV>^F[FV3Wa>-3SY,&lrU3ME"nl,V5-h)ojXXMQfl&`4*(4%f_rCTWWN=h6\$?f0_2:h6Q,1Th`hM*S^rB$3%8NJe8.D);>Tkb4rBH4_@O/4If^qW_hhQ@M>2#a)_a\5,F@c!)Nk>D3n23'UTAb?)kA'Q-S4QlQPUc2&k(+F%j#eqO%Ek\e>8015Be)oXR6oJFT7kU3Qosu&FBrh[YoI@2MF5^R_J"U>;bjg>@chBJH8pSsjl$cr9'"#-LVVUB/2G\9le/uY/pYRW[ac.@6JI;A)agotf;Fl6)4PV@Q45R8s.-l+Jk2us5?=4ukFpM?KZ+I`2'IARaNRq+d&)r7A'Y4qj'&d@O;":]t!J_LUG]:%Ibq[s27Z$nSqB`7b=`Qie+"6dD5L:oq$"2)i(hNdaKgtV%_0O\#dHGa5'=10tIW4[H7>t8=A#J"%d`?lJ]LE\`d&%46SqBTSA:YnQiN"hLL#,sBE.c9))`)k;HD0:W;PJ,:RISaiRATJ.\6;jDq^hX2>4^*AV.n[cV;8+;?!OAZC#3']p%pO4+Z$]st*W6-a4/@'~>endstream +endobj +20 0 obj +<< +/Filter [ /ASCII85Decode /FlateDecode ] /Length 676 +>> +stream +Gau`Q?#SFN'RfGR\;tISN;toTjCtb-e_iQ5<,"7gbHdCfDA(uXErd!Xq'`]<]>[Dd8q*8!=L$qAO@u/6\RJ.Z`]j6Jt&)El`Uk%$TXq?7L>i2?@GN7a_%!.RaL1HFdDh.`go!T^nDf9]df@ms<4-.iDaY8H+rk+'M`7/uHN-_2(?kc&,@#B.c5%p]fX7,?o*1@"M$P(M5WB^YE+K4a0tYTqF9E.jYOeORSNs"]\7c^:U+PadLT:Yhe=?0ZZ"HEs:%(\4,)4mY't4)1#LGE_SRkD4CGnG_!;s/3gDq%GeGW%@0YDIH9eeJrG_n@\O"Q$fam\[!dQ3Ag,IuUU3UY+"7'I>\3`s$hIe05/1Hd+4;>H.F(MdAdDDK?-bcWa=eKri6nIkmg.!(<+JZL%NQ2kDUYY#V`GLWhAu3Wq/'mbId;T23m'6L")!Bi^^=+:"([fA['hXQHYendstream +endobj +21 0 obj +<< +/Filter [ /ASCII85Decode /FlateDecode ] /Length 700 +>> +stream +Gau0@?#SFN'RfGR\.;Q5_dWP%j7!gReolZ\K"%]l02\'&-6Q5*<$Cel(WWDg7Pa@^7#I"@q&H#JD_;&q^81ThXT82)hL?e9.&^c$]ZWSM+]MOn3^()M/;6U1@hR8ug>dFL!:PIGqSNjKJI69[r<2f]bRoa*On3peV(eu4]`t+:6#::-5VP9A]87"qAKW)sRcDD:V8^<*$rfenq"l(5igU9/(a5(PGs.?a@=H,k4NIE5>gJ1"O--Z2uXEX9+&rNHP:B_Rf/XK+$;o,2<>)+N)^RUT,t@-[fAn'I-!Y3Rb^JOi(E<`+iK&blpS4YKe`kjj-)"XjF+a!1in'YeaFC4J)-RfoLEhC[latC$7Ld@:6r/Yl.Id*`kEaN_WTRu.%$S!RaqPr&_-5J)-Bd@uf.)5(,#/u;/\-ocp3-&WroJtUA-n4`Zl(qHW6qlSMoQ=$O8cG7AX)pM6PH"qf9ku#N@oc($VMH9~>endstream +endobj +22 0 obj +<< +/Filter [ /ASCII85Decode /FlateDecode ] /Length 970 +>> +stream +Gat%!?#Q2d'RfGR\EXa`3X16Kb190:9smOf;P4PaOpq!/pWj5YK/d+D\AcR9$CnDX8.%_;]?(3EJ4_4SWp^C^@`fm\>[0em&kZGFjQ=bd<8Ylk!TrO&0+pN,iT;g7rR55j/Ll!8l?(]5q!!"=hhS0-2g!IheT_'>!**#1<+.5%.!\ZTn)#f2dj^p6Z`JRqu9^s,244p>[B_N*JTj:4R;=4)A^QPi\@<2jF2OVWcn@jI,jd[S<`>S()h&]L"["WH9S#Sga^P7Re,j_gqaOqn=bG[84uh;#$kEY?1441LH;9J#FGa5]XK"er1[-3H=e[-@1Hk]D8(`_Bc'#2GtR9r+2-,48f5YjTSB]:,!f_oP!+P-27!Pb:iEhA6LcTC9.RC,[V,ognSlXC>k,_5"ca2UlnQ:gg/a:'+&6;gLsEm^W@\n.)aP\'$I\%a_TL*b]0eO'a>&dtg8@B\*TU+gA6,]B6'bmP[f)9&*P$4&,,P;t!-R#'$%l0InPD:^)C_UM\'likS+6Jh&an5t-aB!)OhjZW_H!O>j^>ZB$l6j&QYe@C#qlNalSnP-l5(N8YcbR>a\13Y%^<-+feZ(!4kT)'Fq)f$S]E0YVeTPI*NaWcLfj33eY5>ZZYsQfp^bTCS&>~>endstream +endobj +23 0 obj +<< +/Filter [ /ASCII85Decode /FlateDecode ] /Length 1229 +>> +stream +GasIf?#S^^'RcT\EA^O_\M_fB1)TQ#GS%m]*4siAj:1PGKZ]l9Y+MQEpK)c`4700HAC<4)AcOZ",`9[%D/f@&a_FN_[d=;?qc7$WAa.GSq=0Y^!0?2+$8"W2eBF%`\,DTel5CQ0tpaOb]^oBBoQD2&8QfYOn,+L>F(Deg>m'NPOejZ9+=pLl[I2KKd$^cgDj.u&)QU^T!2$CnjE'IB^\_^K/Y\.%h?7+jJK"_FsV>Td\clS)ah[Y_j(esg3`8kFR4kD1)+G`eX<0>3BX\-M(&/g,QNUWE>tPSB8Jh0XAgOU`0`n0JEL*\P>8Op3Oo'q-EISr]__,NJ#oDb%g2NklCY5E[R3*'1q#LUuaBM)igE%lM(YU-Rj&(,e^KdYk_:aR%]g:S6gC;5c[Lih*?6ZFm[X]Wnl>=GF-9H?H>/lZK.QqeT]"*am1nT%WDaEt)8PBb>($TCWWq=WC5dj?E:et-?=?TMni?o6n$t[qX':&Y#>tp#+*AT1IOO0m:R;Jt_#~>endstream +endobj +24 0 obj +<< +/Filter [ /ASCII85Decode /FlateDecode ] /Length 891 +>> +stream +GasbY?#Q2d'Sc)R/'^Sk[1hB#0d#X#9rNe7Z`JBfPBu77DSjC`##<_HqjFWE[+MLC+p@GrF5HM[!s2D-W;0_1!&kV+h',*$kT_5@PlQddTP)Mj4Sg%?#T&,/GsF0q=Ki5>pDRqe\Nc^@IKVCuBlFm4kp=\@5ei9E`Fcf-r1/Z`mYr;[?3'Ni+JpMmq)&&ZqNIgei=l)1MdNESraEtk"damU-*s88JnlF9*]O/%^me]N[8HHSp*<#GVoO)j;D&)Zb6i<$c,^XCAeKO40DU?S#CaHj&XKOO^EODTXIFPR2.kjMLtqfPf]JgR\K5URaMB&HF@6sJg4ST>K!T-\h<;4gS[a7[C#V0jK!rjUrXkKiM-"#`'%A*JPIlSS-rFM!`_frm>._PcSgh"&M2fb[EV1@d$8bE:9mphhPeLS=+ZcAA%bKFVepD43%YT%JP"_LnbB.(:Op*q[*@/[7k[k.p]GF\\kF2+JN?W5@l[S9p[YqU`.9`Xnpem'>(gOaOY`2/Nab`V;[(=4]*R.K3683.#IBiO%.Y/r'S%_^0.g*.p]p"T1M`9[W>c,(:b+=ZV*/6:Q3Gi2e1DFaa[?oP&q;`Zm((]A)]QhUl^+Ibj0<!]@JUmLM@$^Tk]NEMYsq.=Hu&_f&\%EohBf4Lb7bV`bFi9W[k.kEF~>endstream +endobj +25 0 obj +<< +/Filter [ /ASCII85Decode /FlateDecode ] /Length 154 +>> +stream +GarW10aki`'SH/ZM@9tpMSAgg\?8mqiPEnc`kbes&H)R/ol0,6$"QTbMaCH[al`nrLsned#/6bJE<%d-S'kNKf0:Gr69g&PKnmf_o%R>#N#C=/jOre^L,bffY'9%276Bm_$f*5:#N'Zkf]EIa(G<endstream +endobj +26 0 obj +<< +/Filter [ /ASCII85Decode /FlateDecode ] /Length 1110 +>> +stream +Gas1^9lo#Z&A@7.oO/9`/OLd,J`*mF-FHFa%Y[p)Gq9jjCMAOILC*kD[sKA\].&f!.LIoq2r1[plV8Us:E$t5%cDa=mlmP/Al'Kjc)3R$T4([3$3+WkrgsZ3M[UB-&ulGuDT*3]>f$9[DR[16I_!)DgoWNM6hNB;J%Q>Y,olXL=crs0)\`LjMUpf/f%rIDTt-%)cHA>-31nM!i\q!K&C*a+WK,mn89N:VK8]Gr%j8KJ2GPY*rj3IO/\aVb"tZ2j=]9QNCJX,r&J!*lD8D'!"?[[.G+]Rto'HTYl`M@'ZM>37"t`0t#^%CP^AT=9j3&9?X/"fNC>F$\Fi_WB:.@Ii4bg&`dPZ+c66hNojl!`eX`To%)6#c$J)\DEEH:V*:J#cFV1!h?0S\BYmK(Q>Y1MuS_KAWiAW^V&KOU>[#kp\$gK(6fC^kR9GOXHnWR0?Y^g]J-XM$>^gT=MZ+mO:$[MA^@/uQm(4M3(tn2ic4)S4."%F*qM/[Ct8D_)B4qlWkEj&?f98An!Fb;Ca)@dOU`\C4&2D-.h^V_j5.Soh#uBLsl.77$Nc,U^dm/P&T6#JbZTcmY5?^jMqjUrHWI\"KcT:=^0h1'"-da!d+UZ6'S,SgksT$&YN+Xo])&GVe=bY?u'M"OSoG=Ci;ErD?N&N"n.JHi.4!jhDI^j5GG?=`sI*U$M>aY/KMl\nd6Nl>Vf-0O,%-47Q`E3(G[o%Rij7q_cbN'PeD[tF.,knLnQAL2KEbt6d`lt5?l(N5F_Ca)X$Fm+kQ[@YL&Yt;de?n7E.nHAWf`EU6m4j.gDGRc][a1cE'Jmr](ZK`?1hLU8Kg]iulGln\9kk]4.IsJuCCnqdT^MN24KE>"D3df=.idn8ipaS,4ZBS!@ujZ:3)=76Z,.SYK6[Wo;f*`G=aI(gL"4Z/Q)cNVf^ak>0;`$O@p(Lq8cb44E+8f!dn_DUa^V^cBr!O1jB%6~>endstream +endobj +27 0 obj +<< +/Filter [ /ASCII85Decode /FlateDecode ] /Length 1134 +>> +stream +Gas1^9on9n&A?DnW67g*MIB'^&@hD;TP\D@Hc/>;$Q%h;M.[G_Z5n8_n#'tm2`nkg&d68JWnse[#c@6Nr&:e[^1?`b/qPL+li^0@8KdmH[b3$X*oB,AXq8bnQEq:2)FiUdJZr=qpW_1\3IBICD.)u+SC-p1A3V46o8[NIFf+oj9k+_2GViGWDV)*UZ`7S;-jf/A0:P*eEo,6V%_e[fJt&F.7@F?"/J\`_-nBuPA)"YK7\`7`%5?_?VLKma,L-Si^CX_:c=*c]d]d@"i@oT6mLKiDS7/0]c#a2u%h29UM&Crc#[+r9@g-K[CeEj"+TW]eKbPT^FN.c!%;N2KRH->Y=/8Z'hU:L*n3;4Hl:Zq<.q@Mu_TN[4h'm:MVbgQi)3a5adheRH==l"P`afRfe;.KlPL>.QPZ?6ps(2np0hK8Dc?qB;P`&H$W_u9p%Kf.dlt0o^Cm8g@Z\G%2:pr.2`3JJl3?\Ho:/4uT..+8`3HJ1V+e]OE*ODlfTVqV.Vge7U?K;@fZ)'QZ12bke7RE]3i(bAR*(=ANHFc"iLG8XHEE0;["a/\Yq\M\@+XQ,c.hD-8`P5#:X??W^6D]d`p/!D@RTM?&JElNZmBkoDn+=cPA=g+]NW#P`osQP;U3<2UH@6H8Q/&@)"i4B8TR@on'^[a`D,DeNdKi[?@F8ee+udOj^M(8"6e/K&Cg-I$2p@53>:8h/:D$K_Cu9(h=Voqpd\rU-Kc5TK>1_(+C3)qA*&5ZrLL74W0j?>EDagSOE((\ZsR70~>endstream +endobj +28 0 obj +<< +/Filter [ /ASCII85Decode /FlateDecode ] /Length 1156 +>> +stream +GasIfgJ6d"&:L1SW0`X6eWTSW2mca;#p:B'@#<9G,[NP!9&,?l[bC+6k$c#QX1FLG8D#/_@("ggWm7S)'gb08U&1*$^c[<8Y!WI>82h),K*9V,\4XNGoB=Zpn,P2'?lVrb/`H%^C_=PB!`i+?jm;M!bjI)lEIYI-I9PI^hgL.L?V#\DjF.g4rQj@)Oi'q8eUYX(]]L=Uok%E\OD3C6=BW^6!u[E9C9/!_1U&lfV2I[TWbMo>=-c(Wq\gD72fQNV,e1kBgPa@q7['W(Z-d7G&mH?K0,91AlddOkocPTQWc.qsPr1U<-!pXLAT5COZW$YQI5uI5SOEWOn]dZg.!Br`P^0+$.,a[3L1&)?PgD3)&)^W\&mgjJiEBZP2(jkWNq-b`hlu?u(eu))B*fk.enlQ=>RDtk:]r7!'AIqbm("&F@.nR2Tn6:(;JkW[/*MgSoIQ",E/Rr3Q_Go>g$:+d^UThZ)TP6rW^J-J+j'T=1;iTpb2XBU.41gT>,N($Brq8Bl8%)c0Nqe^g@WnJJkIeY$f<4=`ip^8J'Y`kgk%SJi-9T/jBD.MA7)1'lZYkG[8j("iK>7(Q5r`4I4Cb3`NCVHhGEAnQJ04-Hdo2'H1XrNq!pXOqh*T~>endstream +endobj +xref +0 29 +0000000000 65535 f +0000000073 00000 n +0000000114 00000 n +0000000221 00000 n +0000000330 00000 n +0000000525 00000 n +0000000720 00000 n +0000000915 00000 n +0000001110 00000 n +0000001305 00000 n +0000001500 00000 n +0000001696 00000 n +0000001892 00000 n +0000002088 00000 n +0000002284 00000 n +0000002480 00000 n +0000002550 00000 n +0000002847 00000 n +0000002976 00000 n +0000003606 00000 n +0000004518 00000 n +0000005285 00000 n +0000006076 00000 n +0000007137 00000 n +0000008458 00000 n +0000009440 00000 n +0000009685 00000 n +0000010887 00000 n +0000012113 00000 n +trailer +<< +/ID +[<30157dc3b9cf65b8d1eaf3493559908e><30157dc3b9cf65b8d1eaf3493559908e>] +% ReportLab generated PDF document -- digest (http://www.reportlab.com) + +/Info 16 0 R +/Root 15 0 R +/Size 29 +>> +startxref +13361 +%%EOF diff --git a/tests/fixtures/scanned/receipt/receipt-300dpi-scanned.pdf b/tests/fixtures/scanned/receipt/receipt-300dpi-scanned.pdf new file mode 100644 index 0000000..5cfb0b2 Binary files /dev/null and b/tests/fixtures/scanned/receipt/receipt-300dpi-scanned.pdf differ diff --git a/tests/fixtures/scanned/receipt/receipt-300dpi.pdf b/tests/fixtures/scanned/receipt/receipt-300dpi.pdf new file mode 100644 index 0000000..4cb843e --- /dev/null +++ b/tests/fixtures/scanned/receipt/receipt-300dpi.pdf @@ -0,0 +1,68 @@ +%PDF-1.3 +%“Œ‹ž ReportLab Generated PDF document http://www.reportlab.com +1 0 obj +<< +/F1 2 0 R +>> +endobj +2 0 obj +<< +/BaseFont /Helvetica /Encoding /WinAnsiEncoding /Name /F1 /Subtype /Type1 /Type /Font +>> +endobj +3 0 obj +<< +/Contents 7 0 R /MediaBox [ 0 0 612 792 ] /Parent 6 0 R /Resources << +/Font 1 0 R /ProcSet [ /PDF /Text /ImageB /ImageC /ImageI ] +>> /Rotate 0 /Trans << + +>> + /Type /Page +>> +endobj +4 0 obj +<< +/PageMode /UseNone /Pages 6 0 R /Type /Catalog +>> +endobj +5 0 obj +<< +/Author (anonymous) /CreationDate (D:19800101000000+00'00') /Creator (ReportLab PDF Library - www.reportlab.com) /Keywords () /ModDate (D:19800101000000+00'00') /Producer (ReportLab PDF Library - www.reportlab.com) + /Subject (unspecified) /Title (untitled) /Trapped /False +>> +endobj +6 0 obj +<< +/Count 1 /Kids [ 3 0 R ] /Type /Pages +>> +endobj +7 0 obj +<< +/Filter [ /ASCII85Decode /FlateDecode ] /Length 1055 +>> +stream +Gau1-gQ%aW&;KZN'N:ur[T1NI#UFUhiapT#5f$;.l4$A=KQ[GJ@!d.q)B@7-MJ6-lM*`3"HlgTu[Kco_+5Gat7t@"Zb:=!gJ@Yd**!RkqW?)NIW:))uHgqD-?1>?3l3J3j2O;Hg,V,i:O`S'leMC$o5Wq>A_anE]rYb58YH@);$N^;/!^ujtXhL)S@%.Kb#,*(O+0&4#n_o(@ro$ubHc`e`=m]JO[8+s=qFP$FZ;r1^^Eie)6,_;l(J!;6fp)%V,Bc%*O\,+ldBn0^"fcG*XDG1J>?%Fg+@`jG0[H1TdCm?*S]7&QY`]ZOi%VH44q$;R1WB9E@B9ge7',Q%FH-AhMBe5a>gq@cYlA+QU`&LBo)>OOV)&ZG](+Oo8D"X5>%&3J@YpKr%P)n>ECk>C_OYW2pIhiESJ8\qi1;k+eA0faM1GZ\&Jo.5)8>KJkK6T8J+s1idO6%F(:F\h8=&NdR[,!:EUL_S_DceaNfSX*R:f/;sHII-a_t:UV^S(N+m>n!@;/mbEe15O6+C)Y;/f.^JZ]=&u5HmGCL)j"8s5!Yk)U@AG^CcN\jS.tsX5cpV?og#?.e+:%LHk`aE0hf@rQq@3s*mC&h-j/E-m"r>SmF@s%d(iYL/Mc.R#CV56"B386k1+QPRJM+fECb2@+V(s>i]*:6RJ]:*jY>mI';tsVQ-=M<-NOT\VI@WH[u"UZ'A&];a#8mB2&kCMPb(GVCV'-6Y5HAhm2:uD=6bjUS2=-1qRbo:tJ*[;aHt`pN>0]N`>a)<->H!b3W.nf]C(X3s+C%B=Xul`A61lL"!*fTM*;"`n;s.YOR3756.L]9\0~>endstream +endobj +xref +0 8 +0000000000 65535 f +0000000073 00000 n +0000000104 00000 n +0000000211 00000 n +0000000404 00000 n +0000000472 00000 n +0000000768 00000 n +0000000827 00000 n +trailer +<< +/ID +[<30157dc3b9cf65b8d1eaf3493559908e><30157dc3b9cf65b8d1eaf3493559908e>] +% ReportLab generated PDF document -- digest (http://www.reportlab.com) + +/Info 5 0 R +/Root 4 0 R +/Size 8 +>> +startxref +1973 +%%EOF diff --git a/tests/fixtures/scanned/run_gen.sh b/tests/fixtures/scanned/run_gen.sh new file mode 100755 index 0000000..5c6d056 --- /dev/null +++ b/tests/fixtures/scanned/run_gen.sh @@ -0,0 +1,7 @@ +#!/run/current-system/sw/bin/bash +# Wrapper script to run generate_scanned_fixtures.py with nix-shell dependencies + +nix shell nixpkgs#python3Packages.reportlab nixpkgs#python3Packages.pillow nixpkgs#python3Packages.img2pdf nixpkgs#poppler_utils --command bash -c ' + cd "$(dirname "$0")" + python3 generate_scanned_fixtures.py "$@" +'