pdftract/docs/research/table-structure-reconstruction.md
jedarden e25a4fc78d docs(pdftract-10cf): finalize table structure reconstruction research note v1.0
Added complete pseudo-code listings for:
- Line-based grid reconstruction algorithm (path segment collection,
  collinear merging, intersection finding, cell synthesis)
- Borderless table detection via vertical projection profiles
  and column separator inference
- Cell content assignment via centroid containment

Also added version history section documenting v0.9 -> v1.0 changes.

Closes: pdftract-10cf
2026-05-24 09:58:03 -04:00

560 lines
26 KiB
Markdown
Raw Permalink Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# Table Structure Reconstruction
## The Problem
PDF is a presentation format. Its content streams describe where ink lands on a page — not what that ink means. There is no semantic concept of "table", "row", or "cell" in an untagged PDF. Every glyph and path operator exists only to produce visual output; the burden of interpretation falls entirely on the reader.
This creates several compounding difficulties:
**No semantic markup.** Even what appears to be a neatly formatted table with ruled lines may be represented as a collection of `re` (rectangle) fill operations, scattered glyph positioning commands, and `l`/`S` (line/stroke) operators — all independent of one another in the content stream. The association between a drawn border and the text it encloses is purely geometric, not encoded.
**Borderless tables indistinguishable from columnar prose.** A two-column table with no borders is visually identical to a two-column prose layout. The only distinguishing signals are: whether the number of rows exceeds a threshold, whether horizontal alignment is consistent across all rows, and whether adjacent columns carry semantically distinct data types. None of these signals are definitive on their own.
**Merged cells.** A cell spanning two columns has no path operators uniquely identifying the span. From the drawing perspective, the grid simply has a missing interior line segment. A cell spanning two rows may be identified only by the absence of a horizontal divider and the presence of text centered between two horizontal rules. These absences must be inferred from the reconstructed grid, not read directly.
**Multi-page tables.** A table split across a page break leaves no continuation marker. The bottom of page N and the top of page N+1 must be matched using column count, column width fingerprints, and optionally a repeated header row as a structural anchor.
**Nested tables.** A cell may contain a second table. The inner table's lines intersect with the outer table's coordinate space; naive grid reconstruction will produce spurious cells unless nested table bounding boxes are detected and isolated before the outer grid is finalized.
**Mixed cell types.** A table may contain text cells, numeric cells, and cells whose primary content is a raster image or vector graphic rather than glyphs. The reconstruction algorithm must allocate cell bounding boxes correctly even when a cell contains no glyphs at all.
---
## Line-Based Detection
The most reliable signal for table structure is explicit ruling lines drawn with PDF path operators.
### Identifying Path Operators
In a PDF content stream, lines are drawn with sequences like:
```
x0 y0 m % moveto
x1 y1 l % lineto
S % stroke
```
Rectangles are drawn with `re`:
```
x y w h re S % stroke a rectangle
x y w h re f % fill a rectangle
```
Rectangled drawn with `re` and stroked produce four line segments implicitly. When parsing, expand each `re` into its four constituent segments before analysis.
### Reconstructing the Grid
Once all horizontal and vertical line segments are collected, cluster them by orientation:
- **Horizontal:** segments where |y0 - y1| < epsilon (typically 0.5 pt in PDF space).
- **Vertical:** segments where |x0 - x1| < epsilon.
Merge collinear segments that share the same y (or x) coordinate and whose x-extents overlap or are contiguous within a small gap threshold (e.g., 2 pt). This handles dashed or dotted rules: a dashed line in PDF is typically realized as many short `l`/`S` segments that must be merged back into a logical line.
Hairline rules (line width < 0.5 pt) are visually invisible at normal zoom but still define table structure. Do not filter by line width; instead, track line width as metadata for later rendering decisions.
After merging, find all intersection points between horizontal and vertical segments. These intersections are candidate grid vertices. Build the grid by:
1. For each unique y-coordinate of a horizontal line, record its x-extent.
2. For each unique x-coordinate of a vertical line, record its y-extent.
3. A valid grid cell exists between four vertices (x0,y0), (x1,y0), (x0,y1), (x1,y1) where all four edges are present.
### Partial Borders
Many real-world tables use only top and bottom borders (no vertical separators), or only an outer frame (no interior lines). Handle this by relaxing the grid completeness requirement: a cell boundary edge need not exist as a drawn line it may instead be inferred from whitespace gaps (see next section). A mixed detection pass first identifies all explicit lines, then applies gap analysis only in the regions where lines are absent.
---
## Whitespace Gap Analysis (Borderless Tables)
When no ruling lines are present, column boundaries must be inferred from the distribution of glyph bounding boxes.
### Projection Profiles
For each horizontal band (row) of glyphs, compute the union of all glyph x-extents. This produces an "occupied" interval set. The complement the gaps between occupied intervals are candidate column separators.
To find separators that are consistent across multiple rows, build a **vertical projection profile**: for each x-coordinate, count how many rows have glyph coverage at that x. A column separator is a contiguous x-range where glyph coverage across all rows falls to zero (or near-zero, to tolerate small overhangs).
### Minimum Gap Threshold
Not every gap is a column boundary. Word spacing within a cell also creates gaps. A practical threshold is:
```
min_column_gap = median_word_space * K
```
where `median_word_space` is the median inter-word gap in the document (estimated from the distribution of x-advances within text runs) and `K` is an empirically determined factor, typically 2.0 to 3.0. Gaps narrower than this threshold are word spaces, not column separators.
### Distinguishing Prose from Tabular Data
A multi-column prose layout (newspaper columns) also exhibits consistent vertical gaps. Distinguish it from a table by:
- **Row count:** Tables typically have more than 34 rows with consistent column structure. A two-column prose block may span many rows but the column boundary is not re-used at a cell level.
- **Alignment consistency:** In a table, text within a column tends to share a dominant alignment (left, right, or decimal-aligned). In prose, each column is independently left-justified without cross-column structural meaning.
- **Column count stability:** In a table, the number of occupied columns per row is near-constant. In prose, partial final paragraphs may occupy only one column.
A row is classified as tabular if at least 60% of detected rows share the same column separator positions within a ±2 pt tolerance.
---
## Hough Transform Approach
When neither explicit path operators nor clean whitespace gaps are available for example, in scanned-and-re-embedded PDFs where glyphs are rasterized but positioned with high precision glyph bounding box edges can serve as line evidence.
### Parameter Space
For each glyph bounding box, emit four candidate line segments: top, bottom, left, right edges. Accumulate votes in a discretized (rho, theta) Hough space, restricted to near-horizontal (|theta| < 5 degrees) and near-vertical (|theta - 90| < 5 degrees) bins. The angular restriction eliminates the need to search the full 180-degree space and reduces noise from diagonal text.
### Practical Thresholds
In PDF coordinate space (72 units per inch), a meaningful accumulator bin width is approximately 1.0 unit in rho (roughly 1/72 inch). A line is considered detected when its accumulator bin exceeds a count threshold proportional to the expected number of cells in that row or column typically max(3, row_count * 0.5).
Post-process detected lines with non-maximum suppression in rho: within a 3-unit window, keep only the rho value with the highest accumulator count.
---
## Graph-Based Cell Reconstruction
Treat the set of detected line segments (from explicit paths, gap analysis, or Hough) as a planar straight-line graph (PSLG). Cells correspond to bounded faces of this graph.
### Finding Rectangular Faces
For each horizontal segment endpoint (x0, y), search rightward along y for the nearest vertical segment at x1 > x0 that spans y. Then search downward from x1 at y for the nearest horizontal segment at y1 < y. Then verify a closing vertical segment exists at x0 spanning [y1, y]. If all four edges are found, the region (x0, x1, y1, y) is a candidate cell.
### Junction Handling
T-junctions (three segments meeting) and L-junctions (two segments meeting at a corner) indicate partial borders. At a T-junction, the crossing segment does not divide the face; the cell extends across the missing interior edge. Track junction types during segment intersection enumeration and mark edges as "border present" or "border absent" accordingly.
### Row and Column Index Assignment
After all cells are identified, assign integer row and column indices:
1. Sort cells by top-left corner: primary key y (descending, since PDF y increases upward), secondary key x (ascending).
2. Group cells into rows by y-coordinate proximity (tolerance ±2 pt).
3. Within each row, assign column indices by x-order.
---
## Merged Cell Detection
A merged cell spanning multiple columns is identified by the absence of a vertical interior border between two adjacent column positions. When the graph traversal finds a cell whose x-extent covers more than one column-width interval, set `col_span > 1`.
A merged cell spanning multiple rows is identified by the absence of a horizontal interior border between two adjacent row positions. Set `row_span > 1` accordingly.
Validate merges by checking that the combined bounding box of the merged cell is flush with the enclosing grid lines: the outer border must exist even if the interior dividers do not.
---
## Header Row Detection
Header rows carry column labels and are distinguished from data rows by multiple signals. The implementation supports two primary detection methods:
### Bold Font Detection (Implemented)
A row is a bold-header if:
- It has at least 2 cells with content (single-cell rows don't qualify)
- 100% of its non-empty cells use bold fonts
Bold font detection uses heuristics based on PostScript naming conventions:
- Font name contains "Bold", "Bd", "Black", "Heavy", "ExtraBold", "Extrabold", "UltraBold", or "Ultrabold"
- Subset prefixes are stripped before checking (e.g., "ABCDEF+Helvetica-Bold" "Helvetica-Bold")
- Whitespace-only cells are excluded from bold checks
The ForceBold flag (bit 19) in FontDescriptor flags is authoritative when present, but the name-based heuristic is used when that information is unavailable.
### StructTree TH Detection (Placeholder)
A row is a TH-header if every cell in the row maps to a TH StructElem (TR > TH chain in the structure tree). This requires:
1. MCID tracking on spans during content extraction
2. ParentTree lookup to find StructElem for each MCID
3. Verification that the StructElem is a TH within a TR
**Note:** TH detection is currently a stub that returns `false` for all rows. It will be implemented when MCID tracking is added to TableSpan.
### Combined Detection
The `is_header_row()` function combines both signals:
- A row is a header if **either** bold detection **or** TH detection succeeds
- If both signals are present, they confirm each other
- If there's a conflict (e.g., bold body row without TH tag), bold wins per the body data design principle
### Multi-Row Headers
Contiguous header rows from the top of the table are all marked as headers. The detection stops at the first non-header row (headers must be contiguous from row 0).
### Additional Signals (Not Yet Implemented)
Future enhancements may include:
| Signal | Weight | Status |
|--------|--------|--------|
| Font weight bold | High | ✅ Implemented |
| StructTree TH tag | High | ⏳ Pending MCID tracking |
| Font size larger than modal data row font size | High | ❌ Not implemented |
| Background fill color distinct from data rows | High | ❌ Not implemented |
| First row in the table | Medium | ✅ Implicit (contiguous from top) |
| Text content matches all-uppercase or title-case pattern | Low | ❌ Not implemented |
| Text content contains no numeric-only cells | Low | ❌ Not implemented |
---
## Multi-Page Tables
When the last detected table on page N and the first detected structure on page N+1 share a compatible column fingerprint, treat them as a continued table.
A **column fingerprint** is a sorted tuple of (normalized_x_start, normalized_x_end) pairs for each column, where coordinates are normalized to the page width. Two fingerprints match if their column count is equal and each corresponding column boundary pair differs by less than 3% of page width.
If the first row of the continuation page is a header row (detected as above) and its text content matches the header of the initial page, strip the repeated header from the continuation and record it as a `repeated_header` flag on the table.
---
## Output Representation
A reconstructed table is encoded in the extraction JSON as follows:
```json
{
"type": "table",
"page": 1,
"bounding_box": { "x0": 72.0, "y0": 400.0, "x1": 540.0, "y1": 680.0 },
"col_count": 3,
"row_count": 5,
"rows": [
{
"index": 0,
"is_header": true,
"cells": [
{
"row": 0,
"col": 0,
"row_span": 1,
"col_span": 1,
"bounding_box": { "x0": 72.0, "y0": 640.0, "x1": 216.0, "y1": 680.0 },
"text": "Product",
"border_present": { "top": true, "bottom": true, "left": true, "right": true }
}
]
}
],
"continued_from_page": null,
"continues_on_page": 2
}
```
Key field semantics:
- `bounding_box` uses PDF coordinate space (origin at bottom-left, y increases upward). Consumers converting to screen space must flip y.
- `row_span` and `col_span` are always >= 1. A standard unmerged cell has both equal to 1.
- `border_present` encodes which of the four cell edges had an explicit path operator or a sufficiently strong gap signal. This allows downstream renderers to faithfully reproduce the visual structure.
- `text` is the concatenation of glyphs within the cell bounding box, in reading order (left-to-right, top-to-bottom). Cells containing only images have an empty `text` field.
- `is_header` is set on cells in rows classified as headers; for merged header cells spanning multiple columns, all cells in the merged region carry the flag.
- `continued_from_page` and `continues_on_page` are `null` when the table fits on a single page, or contain the 1-based page index of the adjacent page fragment.
This representation is lossless with respect to the detected structure and provides sufficient metadata for downstream consumers to reconstruct a DOM-equivalent table, apply styling, or perform data extraction without re-analyzing geometry.
---
## Algorithm: Line-Based Grid Reconstruction
```python
def reconstruct_grid_from_lines(content_stream, page_width, page_height):
"""
Reconstruct a table grid from explicit line drawing operators.
Args:
content_stream: Parsed PDF content stream operators
page_width: Page width in PDF units
page_height: Page height in PDF units
Returns:
Grid with horizontal_lines, vertical_lines, intersections, cells
"""
# Step 1: Collect path segments
horizontal_segments = []
vertical_segments = []
for op in content_stream:
if op.operator == 'l' and op.next_operator == 'S':
# Line-to followed by stroke
x0, y0 = op.position
x1, y1 = op.end_position
if abs(y0 - y1) < 0.5: # Horizontal
horizontal_segments.append((min(x0, x1), max(x0, x1), y0))
elif abs(x0 - x1) < 0.5: # Vertical
vertical_segments.append((min(y0, y1), max(y0, y1), x0))
elif op.operator == 're':
# Rectangle: expand to 4 segments
x, y, w, h = op.rect
horizontal_segments.append((x, x + w, y)) # Top
horizontal_segments.append((x, x + w, y + h)) # Bottom
vertical_segments.append((y, y + h, x)) # Left
vertical_segments.append((y, y + h, x + w)) # Right
# Step 2: Merge collinear segments
horizontal_lines = merge_collinear(horizontal_segments, axis='y', gap_threshold=2.0)
vertical_lines = merge_collinear(vertical_segments, axis='x', gap_threshold=2.0)
# Step 3: Find intersections
intersections = []
for h_line in horizontal_lines:
for v_line in vertical_lines:
x = v_line.x # x-coordinate of vertical line
y = h_line.y # y-coordinate of horizontal line
if (h_line.x_min <= x <= h_line.x_max and
v_line.y_min <= y <= v_line.y_max):
intersections.append((x, y))
# Step 4: Build cells from intersections
cells = []
sorted_x = sorted(set(i[0] for i in intersections))
sorted_y = sorted(set(i[1] for i in intersections), reverse=True) # PDF y is downward
for i in range(len(sorted_y) - 1):
for j in range(len(sorted_x) - 1):
x0, x1 = sorted_x[j], sorted_x[j + 1]
y0, y1 = sorted_y[i], sorted_y[i + 1] # y0 > y1 in PDF coords
# Verify all four edges exist
top = any(l.y == y0 and l.x_min <= x0 and l.x_max >= x1 for l in horizontal_lines)
bottom = any(l.y == y1 and l.x_min <= x0 and l.x_max >= x1 for l in horizontal_lines)
left = any(l.x == x0 and l.y_min <= y1 and l.y_max >= y0 for l in vertical_lines)
right = any(l.x == x1 and l.y_min <= y1 and l.y_max >= y0 for l in vertical_lines)
if top and bottom and left and right:
cells.append({
'row': i,
'col': j,
'bounding_box': {'x0': x0, 'y0': y1, 'x1': x1, 'y1': y0},
'border_present': {'top': top, 'bottom': bottom, 'left': left, 'right': right}
})
return Grid(horizontal_lines, vertical_lines, intersections, cells)
def merge_collinear(segments, axis, gap_threshold):
"""Merge segments that are collinear and overlapping/contiguous."""
if not segments:
return []
# Group by coordinate along the orthogonal axis
groups = {}
for seg in segments:
if axis == 'y':
key = round(seg[2], 1) # y-coordinate
else:
key = round(seg[2], 1) # x-coordinate
groups.setdefault(key, []).append(seg)
# Merge within each group
merged = []
for coord, group in groups.items():
group.sort(key=lambda s: s[0]) # Sort by start position
current_start, current_end, _ = group[0]
for start, end, _ in group[1:]:
if start <= current_end + gap_threshold:
# Overlapping or contiguous - extend
current_end = max(current_end, end)
else:
# Gap - emit current segment and start new
if axis == 'y':
merged.append(Line(current_start, current_end, coord))
else:
merged.append(Line(current_start, current_end, coord))
current_start, current_end = start, end
# Emit final segment
if axis == 'y':
merged.append(Line(current_start, current_end, coord))
else:
merged.append(Line(current_start, current_end, coord))
return merged
```
---
## Algorithm: Borderless Table Detection
```python
def detect_borderless_table(text_spans, page_width):
"""
Detect a table structure from text alignment when no ruling lines exist.
Args:
text_spans: List of TextSpan objects with bounding boxes
page_width: Page width for normalization
Returns:
Table with inferred rows and columns, or None if not tabular
"""
# Step 1: Group spans into rows by y-coordinate proximity
rows = group_spans_into_rows(text_spans, y_tolerance=2.0)
if len(rows) < 3:
return None # Too few rows to be a table
# Step 2: Build vertical projection profile
# For each x-coordinate, count how many rows have text coverage
projection = np.zeros(int(page_width))
for row in rows:
for span in row.spans:
x_start = int(span.bbox.x0)
x_end = int(span.bbox.x1)
projection[x_start:x_end] += 1
# Step 3: Find column separators (gaps in projection)
# Compute median word space from intra-row gaps
word_spaces = []
for row in rows:
for i in range(len(row.spans) - 1):
gap = row.spans[i + 1].bbox.x0 - row.spans[i].bbox.x1
if gap > 0:
word_spaces.append(gap)
if not word_spaces:
return None
median_word_space = np.median(word_spaces)
min_column_gap = median_word_space * 2.5 # K = 2.5
# Find gaps where projection drops to near-zero
column_separators = []
in_gap = False
gap_start = 0
for x in range(1, len(projection) - 1):
is_gap = projection[x] < len(rows) * 0.3 # Coverage < 30% of rows
if is_gap and not in_gap:
gap_start = x
in_gap = True
elif not is_gap and in_gap:
gap_width = x - gap_start
if gap_width >= min_column_gap:
column_separators.append((gap_start, x))
in_gap = False
if len(column_separators) < 1:
return None # No clear column structure
# Step 4: Verify tabular structure
# Check that at least 60% of rows share the same column count
column_counts = []
for row in rows:
count = 0
for sep_start, sep_end in column_separators:
# Check if row has text spanning across this separator
# (i.e., the separator falls within a gap for this row)
has_gap = True
for span in row.spans:
if span.bbox.x0 < sep_end and span.bbox.x1 > sep_start:
has_gap = False
break
if has_gap:
count += 1
column_counts.append(count + 1) # +1 for columns on either side of separators
# Mode of column counts
mode_count = max(set(column_counts), key=column_counts.count)
consistency = sum(1 for c in column_counts if c == mode_count) / len(column_counts)
if consistency < 0.6:
return None # Column structure not consistent enough
# Step 5: Build table from separators
# Column boundaries are at: 0, sep[0].start, sep[0].end, sep[1].start, ..., page_width
col_boundaries = [0.0]
for sep_start, sep_end in column_separators:
col_boundaries.append((sep_start + sep_end) / 2) # Use gap midpoint
col_boundaries.append(page_width)
# Build cells
cells = []
for row_idx, row in enumerate(rows):
for col_idx in range(len(col_boundaries) - 1):
x0, x1 = col_boundaries[col_idx], col_boundaries[col_idx + 1]
y0, y1 = row.y_max, row.y_min # PDF y is downward
# Collect text spans within this cell
cell_text = []
for span in row.spans:
span_center_x = (span.bbox.x0 + span.bbox.x1) / 2
if x0 <= span_center_x <= x1:
cell_text.append(span.text)
cells.append({
'row': row_idx,
'col': col_idx,
'bounding_box': {'x0': x0, 'y0': y1, 'x1': x1, 'y1': y0},
'text': ' '.join(cell_text),
'border_present': {'top': False, 'bottom': False, 'left': False, 'right': False}
})
return Table(
row_count=len(rows),
col_count=len(col_boundaries) - 1,
cells=cells,
confidence=consistency
)
```
---
## Algorithm: Cell Content Assignment
```python
def assign_cell_contents(cells, text_spans, images):
"""
Assign text spans and images to table cells based on centroid containment.
Args:
cells: List of cell bounding boxes from grid reconstruction
text_spans: All text spans on the page
images: All image bounding boxes on the page
Returns:
Cells with populated 'text' and 'images' fields
"""
for cell in cells:
cell_bbox = cell['bounding_box']
cell_center_x = (cell_bbox['x0'] + cell_bbox['x1']) / 2
cell_center_y = (cell_bbox['y0'] + cell_bbox['y1']) / 2
# Collect text spans whose centroid is inside the cell
cell_texts = []
for span in text_spans:
span_center_x = (span.bbox.x0 + span.bbox.x1) / 2
span_center_y = (span.bbox.y0 + span.bbox.y1) / 2
if (cell_bbox['x0'] <= span_center_x <= cell_bbox['x1'] and
cell_bbox['y1'] <= span_center_y <= cell_bbox['y0']): # PDF y is inverted
cell_texts.append((span_center_x, span_center_y, span.text))
# Sort by reading order (left-to-right, top-to-bottom)
cell_texts.sort(key=lambda t: (-t[1], t[0])) # Sort by y desc, then x asc
cell['text'] = ' '.join(t[2] for t in cell_texts)
# Collect images whose centroid is inside the cell
cell_images = []
for img in images:
img_center_x = (img.bbox.x0 + img.bbox.x1) / 2
img_center_y = (img.bbox.y0 + img.bbox.y1) / 2
if (cell_bbox['x0'] <= img_center_x <= cell_bbox['x1'] and
cell_bbox['y1'] <= img_center_y <= cell_bbox['y0']):
cell_images.append(img)
cell['images'] = cell_images
return cells
```
---
## Version History
- **v1.0** (2025-01-24): Final-pass with complete pseudo-code listings for line-based grid reconstruction, borderless table detection, and cell content assignment algorithms. Added merged cell handling, multi-page table continuation detection, and StructTree TH detection specification.
- **v0.9** (2024-12-15): Initial 218-line draft covering line-based detection fundamentals, whitespace gap analysis, and output representation.