⚡️ Speed up function cells_to_html
by 8%
#441
Closed
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
📄 8% (0.08x) speedup for
cells_to_html
inunstructured_inference/models/tables.py
⏱️ Runtime :
14.0 milliseconds
→13.0 milliseconds
(best of193
runs)📝 Explanation and details
The optimized code achieves a 7% speedup through two key optimizations in the
fill_cells
function:1. Replaced NumPy with native Python data structures:
np.zeros()
for creating a boolean grid andnp.where()
for finding empty cellsset()
to track filled positions withfilled.add((row, col))
instead offilled[row, col] = True
2. Optimized header row detection:
{row for cell in cells if cell["column header"] for row in cell["row_nums"]}
with explicit loop andset.update()
3. Direct iteration instead of NumPy indexing:
zip(not_filled_idx[0], not_filled_idx[1])
with nestedfor row in range()
loopsThe optimizations are particularly effective for small to medium tables (as shown in test results where single cells see 40-56% speedup) because:
For large dense tables (20x20), the performance is roughly equivalent, showing the optimizations don't hurt scalability while providing significant gains for typical table sizes.
✅ Correctness verification report:
⚙️ Existing Unit Tests and Runtime
models/test_tables.py::test_cells_to_html
🌀 Generated Regression Tests and Runtime
⏪ Replay Tests and Runtime
test_pytest_test_unstructured_inference__replay_test_0.py::test_unstructured_inference_models_tables_cells_to_html
To edit these changes
git checkout codeflash/optimize-cells_to_html-metc0l2u
and push.Note
Optimize table HTML generation by replacing NumPy grid logic with native sets/loops and minor sorting/header handling tweaks; update version and changelog.
unstructured_inference/models/tables.py
)fill_cells
: Replace NumPy grid/where with nativeset
tracking, explicit header row accumulation, and nested loops to append missing cells.cells_to_html
: Precomputecells_filled
andcells_sorted
; adjust header detection/thead
creation; iterate over sorted cells for row building.__version__
to1.0.8-dev1
inunstructured_inference/__version__.py
.CHANGELOG.md
with enhancement note for optimizedcells_to_html
.Written by Cursor Bugbot for commit 640b75c. This will update automatically on new commits. Configure here.