⚡️ Speed up method HEDEdgeDetector.run by 13%
#143
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
📄 13% (0.13x) speedup for
HEDEdgeDetector.runininvokeai/backend/image_util/hed.py⏱️ Runtime :
84.5 milliseconds→74.8 milliseconds(best of58runs)📝 Explanation and details
The optimized code achieves a 13% speedup through several targeted NumPy optimizations that reduce memory allocations and leverage in-place operations:
Key Optimizations Applied
1. Memory-Efficient Array Operations (42.6% → 38.4% of runtime)
The most significant bottleneck - the sigmoid computation
1 / (1 + np.exp(-np.mean(edges, axis=2)))- was optimized using in-place NumPy operations:np.exp(-edge_map, out=edge_map)- computes exp in-placenp.add(edge_map, 1, out=edge_map)- adds 1 in-placenp.reciprocal(edge_map, out=edge_map)- computes 1/x in-placeThis eliminates temporary array allocations during the most compute-intensive operation.
2. Reduced Memory Copies
np.asarray()instead ofnp.array()inpil_to_np()avoids unnecessary copying when PIL data is already a compatible array.copy()call intorch.from_numpy()since the tensor conversion doesn't require ownership3. Pre-allocated Edge Processing
Replaced list comprehensions and
np.stack()with a pre-allocated array and direct assignment:4. Optimized NMS Function
_NMS_FILTERS) to avoid repeated allocationnp.putmask()for thresholding instead of boolean indexing, reducing temporary array creation5. Vectorized Scribble Processing
Replaced two separate boolean indexing operations with a single
np.where()call, eliminating intermediate array creation.Performance Impact by Test Case
The optimizations show consistent 5-18% improvements across all test scenarios, with larger images benefiting most (up to 17.7% for 512x512 images). The gains are particularly notable for:
scribble=Trueflag (10-11% improvement)These optimizations are especially valuable in image processing pipelines where edge detection may be called repeatedly on large images.
✅ Correctness verification report:
🌀 Generated Regression Tests and Runtime
import itertools
Function to test
import cv2
import numpy as np
imports
import pytest
import torch
from einops import rearrange
from invokeai.backend.image_util.hed import HEDEdgeDetector
from PIL import Image
Minimal stub for ControlNetHED_Apache2 to allow testing
class ControlNetHED_Apache2(torch.nn.Module):
"""Stub model that mimics HED behavior for testing."""
from invokeai.backend.image_util.hed import HEDEdgeDetector
Helper to create images
def make_image(width, height, color=(128, 128, 128)):
"""Create a solid color RGB PIL image of given size."""
arr = np.full((height, width, 3), color, dtype=np.uint8)
return Image.fromarray(arr)
def make_gradient_image(width, height):
"""Create a horizontal gradient RGB PIL image."""
arr = np.zeros((height, width, 3), dtype=np.uint8)
for x in range(width):
arr[:, x, :] = int(255 * x / (width - 1))
return Image.fromarray(arr)
def make_noise_image(width, height, seed=42):
"""Create a random noise RGB PIL image."""
rng = np.random.RandomState(seed)
arr = rng.randint(0, 256, (height, width, 3), dtype=np.uint8)
return Image.fromarray(arr)
Basic Test Cases
def test_run_on_solid_color_image():
"""Test edge detection on a solid color image (should be mostly flat)."""
model = ControlNetHED_Apache2()
detector = HEDEdgeDetector(model)
img = make_image(32, 32, color=(128, 128, 128))
codeflash_output = detector.run(img); out = codeflash_output # 374μs -> 339μs (10.3% faster)
arr = np.array(out)
def test_run_on_gradient_image():
"""Test edge detection on a horizontal gradient image."""
model = ControlNetHED_Apache2()
detector = HEDEdgeDetector(model)
img = make_gradient_image(32, 32)
codeflash_output = detector.run(img); out = codeflash_output # 337μs -> 315μs (7.13% faster)
arr = np.array(out)
def test_run_on_noise_image():
"""Test edge detection on a random noise image."""
model = ControlNetHED_Apache2()
detector = HEDEdgeDetector(model)
img = make_noise_image(32, 32)
codeflash_output = detector.run(img); out = codeflash_output # 331μs -> 315μs (4.83% faster)
arr = np.array(out)
def test_run_safe_flag_changes_output():
"""Test that safe=True changes the output."""
model = ControlNetHED_Apache2()
detector = HEDEdgeDetector(model)
img = make_gradient_image(32, 32)
codeflash_output = detector.run(img, safe=False); out1 = codeflash_output # 332μs -> 314μs (5.86% faster)
codeflash_output = detector.run(img, safe=True); out2 = codeflash_output # 225μs -> 204μs (9.86% faster)
arr1 = np.array(out1)
arr2 = np.array(out2)
def test_run_scribble_flag_changes_output():
"""Test that scribble=True changes the output."""
model = ControlNetHED_Apache2()
detector = HEDEdgeDetector(model)
img = make_gradient_image(32, 32)
codeflash_output = detector.run(img, scribble=False); out1 = codeflash_output # 328μs -> 309μs (6.14% faster)
codeflash_output = detector.run(img, scribble=True); out2 = codeflash_output # 429μs -> 387μs (10.9% faster)
arr1 = np.array(out1)
arr2 = np.array(out2)
def test_run_safe_and_scribble_flags_together():
"""Test that both flags together work and change output."""
model = ControlNetHED_Apache2()
detector = HEDEdgeDetector(model)
img = make_gradient_image(32, 32)
codeflash_output = detector.run(img, safe=True, scribble=True); out1 = codeflash_output # 543μs -> 516μs (5.17% faster)
codeflash_output = detector.run(img, safe=False, scribble=False); out2 = codeflash_output # 294μs -> 271μs (8.57% faster)
arr1 = np.array(out1)
arr2 = np.array(out2)
Edge Test Cases
def test_run_on_minimal_image():
"""Test on 1x1 image (edge case: smallest possible)."""
model = ControlNetHED_Apache2()
detector = HEDEdgeDetector(model)
img = make_image(1, 1, color=(0, 0, 0))
codeflash_output = detector.run(img); out = codeflash_output # 302μs -> 283μs (6.64% faster)
arr = np.array(out)
def test_run_on_non_square_image():
"""Test on a non-square image (e.g., 16x32)."""
model = ControlNetHED_Apache2()
detector = HEDEdgeDetector(model)
img = make_image(16, 32, color=(255, 0, 0))
codeflash_output = detector.run(img); out = codeflash_output # 312μs -> 302μs (3.33% faster)
arr = np.array(out)
def test_run_on_max_channel_value_image():
"""Test on image with max channel values (255,255,255)."""
model = ControlNetHED_Apache2()
detector = HEDEdgeDetector(model)
img = make_image(32, 32, color=(255, 255, 255))
codeflash_output = detector.run(img); out = codeflash_output # 330μs -> 309μs (6.94% faster)
arr = np.array(out)
def test_run_on_min_channel_value_image():
"""Test on image with min channel values (0,0,0)."""
model = ControlNetHED_Apache2()
detector = HEDEdgeDetector(model)
img = make_image(32, 32, color=(0, 0, 0))
codeflash_output = detector.run(img); out = codeflash_output # 329μs -> 311μs (5.81% faster)
arr = np.array(out)
def test_run_on_alpha_channel_image():
"""Test on RGBA image (should ignore alpha)."""
model = ControlNetHED_Apache2()
detector = HEDEdgeDetector(model)
arr = np.full((32, 32, 4), 128, dtype=np.uint8)
img = Image.fromarray(arr, mode="RGBA")
# Convert to RGB for testing (since original expects RGB)
img = img.convert("RGB")
codeflash_output = detector.run(img); out = codeflash_output # 327μs -> 311μs (5.14% faster)
arr_out = np.array(out)
def test_run_on_large_but_small_memory_image():
"""Test on image near 100MB limit (e.g., 512x512x3)."""
model = ControlNetHED_Apache2()
detector = HEDEdgeDetector(model)
img = make_image(512, 512, color=(64, 128, 192))
codeflash_output = detector.run(img); out = codeflash_output # 7.20ms -> 6.17ms (16.7% faster)
arr = np.array(out)
# Should not crash or exceed memory
def test_run_scribble_flag_output_is_binary():
"""Test that scribble output is binary (0 or 255)."""
model = ControlNetHED_Apache2()
detector = HEDEdgeDetector(model)
img = make_gradient_image(32, 32)
codeflash_output = detector.run(img, scribble=True); out = codeflash_output # 566μs -> 514μs (10.0% faster)
arr = np.array(out)
def test_run_on_large_noise_image():
"""Test on a large random noise image (512x512)."""
model = ControlNetHED_Apache2()
detector = HEDEdgeDetector(model)
img = make_noise_image(512, 512)
codeflash_output = detector.run(img); out = codeflash_output # 7.15ms -> 6.13ms (16.6% faster)
arr = np.array(out)
def test_run_on_multiple_images():
"""Test running on multiple images in succession (stress test)."""
model = ControlNetHED_Apache2()
detector = HEDEdgeDetector(model)
for i in range(10):
img = make_noise_image(128, 128, seed=i)
codeflash_output = detector.run(img); out = codeflash_output # 6.55ms -> 5.52ms (18.5% faster)
arr = np.array(out)
def test_run_on_maximum_allowed_size():
"""Test on image of maximum allowed size (e.g., 1024x32)."""
model = ControlNetHED_Apache2()
detector = HEDEdgeDetector(model)
img = make_image(1024, 32, color=(0, 255, 0))
codeflash_output = detector.run(img); out = codeflash_output # 1.14ms -> 1.00ms (14.1% faster)
arr = np.array(out)
# Should not crash
def test_run_on_minimum_allowed_size():
"""Test on image of minimum allowed size (1x1)."""
model = ControlNetHED_Apache2()
detector = HEDEdgeDetector(model)
img = make_image(1, 1, color=(255, 0, 0))
codeflash_output = detector.run(img); out = codeflash_output # 320μs -> 290μs (10.6% faster)
arr = np.array(out)
def test_run_on_various_sizes():
"""Test on a variety of sizes (16x16, 64x64, 256x256, 512x512)."""
model = ControlNetHED_Apache2()
detector = HEDEdgeDetector(model)
for sz in [16, 64, 256, 512]:
img = make_gradient_image(sz, sz)
codeflash_output = detector.run(img); out = codeflash_output # 9.65ms -> 8.20ms (17.7% faster)
arr = np.array(out)
codeflash_output is used to check that the output of the original code is the same as that of the optimized code.
#------------------------------------------------
import itertools
Import HEDEdgeDetector from the provided code
import cv2
import numpy as np
imports
import pytest # used for our unit tests
import torch
from einops import rearrange
from invokeai.backend.image_util.hed import HEDEdgeDetector
from PIL import Image
Minimal stub for ControlNetHED_Apache2 to simulate model behavior
class ControlNetHED_Apache2(torch.nn.Module):
def forward(self, x):
# Simulate output: return 3 tensors, each with shape [1, 1, h, w]
# Output values are deterministic for testing
b, c, h, w = x.shape
# Each edge map is a constant array for simplicity
edge1 = torch.ones((1, 1, h, w), dtype=torch.float32) * 0.5
edge2 = torch.ones((1, 1, h, w), dtype=torch.float32) * 1.0
edge3 = torch.ones((1, 1, h, w), dtype=torch.float32) * 2.0
return [edge1, edge2, edge3]
from invokeai.backend.image_util.hed import HEDEdgeDetector
-------------------- UNIT TESTS --------------------
Helper for comparing PIL images
def images_equal(img1, img2):
arr1 = np.array(img1)
arr2 = np.array(img2)
return np.array_equal(arr1, arr2)
Basic test cases
def test_basic_grayscale_image():
"""Test with a simple grayscale image."""
model = ControlNetHED_Apache2()
detector = HEDEdgeDetector(model)
arr = np.full((32, 32, 3), 128, dtype=np.uint8)
img = Image.fromarray(arr)
codeflash_output = detector.run(img); out = codeflash_output # 338μs -> 305μs (10.8% faster)
def test_basic_color_image():
"""Test with a simple color image."""
model = ControlNetHED_Apache2()
detector = HEDEdgeDetector(model)
arr = np.zeros((16, 16, 3), dtype=np.uint8)
arr[:8, :8] = [255, 0, 0] # Red block
img = Image.fromarray(arr)
codeflash_output = detector.run(img); out = codeflash_output # 291μs -> 267μs (9.07% faster)
# Output array should have values in [0, 255]
out_arr = np.array(out)
def test_basic_safe_flag():
"""Test with safe=True."""
model = ControlNetHED_Apache2()
detector = HEDEdgeDetector(model)
arr = np.full((10, 10, 3), 200, dtype=np.uint8)
img = Image.fromarray(arr)
codeflash_output = detector.run(img, safe=True); out_safe = codeflash_output # 285μs -> 268μs (6.10% faster)
codeflash_output = detector.run(img, safe=False); out_normal = codeflash_output # 174μs -> 153μs (13.7% faster)
def test_basic_scribble_flag():
"""Test with scribble=True."""
model = ControlNetHED_Apache2()
detector = HEDEdgeDetector(model)
arr = np.full((12, 12, 3), 50, dtype=np.uint8)
img = Image.fromarray(arr)
codeflash_output = detector.run(img, scribble=True); out_scribble = codeflash_output # 404μs -> 380μs (6.13% faster)
codeflash_output = detector.run(img, scribble=False); out_normal = codeflash_output # 241μs -> 221μs (9.25% faster)
def test_basic_safe_and_scribble():
"""Test with both safe and scribble True."""
model = ControlNetHED_Apache2()
detector = HEDEdgeDetector(model)
arr = np.full((8, 8, 3), 100, dtype=np.uint8)
img = Image.fromarray(arr)
codeflash_output = detector.run(img, safe=True, scribble=True); out = codeflash_output # 383μs -> 361μs (6.08% faster)
Edge test cases
def test_edge_minimal_image():
"""Test with minimal 1x1 image."""
model = ControlNetHED_Apache2()
detector = HEDEdgeDetector(model)
arr = np.array([[[0, 0, 0]]], dtype=np.uint8)
img = Image.fromarray(arr)
codeflash_output = detector.run(img); out = codeflash_output # 273μs -> 260μs (4.97% faster)
out_arr = np.array(out)
def test_edge_max_value_image():
"""Test with all pixels at max value."""
model = ControlNetHED_Apache2()
detector = HEDEdgeDetector(model)
arr = np.full((5, 5, 3), 255, dtype=np.uint8)
img = Image.fromarray(arr)
codeflash_output = detector.run(img); out = codeflash_output # 269μs -> 256μs (5.10% faster)
out_arr = np.array(out)
def test_edge_min_value_image():
"""Test with all pixels at min value."""
model = ControlNetHED_Apache2()
detector = HEDEdgeDetector(model)
arr = np.zeros((5, 5, 3), dtype=np.uint8)
img = Image.fromarray(arr)
codeflash_output = detector.run(img); out = codeflash_output # 273μs -> 263μs (4.03% faster)
out_arr = np.array(out)
def test_edge_non_square_image():
"""Test with non-square image."""
model = ControlNetHED_Apache2()
detector = HEDEdgeDetector(model)
arr = np.full((10, 20, 3), 123, dtype=np.uint8)
img = Image.fromarray(arr)
codeflash_output = detector.run(img); out = codeflash_output # 278μs -> 259μs (7.28% faster)
def test_edge_invalid_dtype():
"""Test with float32 image array."""
model = ControlNetHED_Apache2()
detector = HEDEdgeDetector(model)
arr = np.full((10, 10, 3), 0.5, dtype=np.float32)
img = Image.fromarray(arr.astype(np.uint8))
codeflash_output = detector.run(img); out = codeflash_output # 295μs -> 278μs (6.04% faster)
def test_large_image():
"""Test with a large image (max 1000x1000, <100MB)."""
model = ControlNetHED_Apache2()
detector = HEDEdgeDetector(model)
arr = np.full((512, 512, 3), 128, dtype=np.uint8)
img = Image.fromarray(arr)
codeflash_output = detector.run(img); out = codeflash_output # 6.91ms -> 5.87ms (17.7% faster)
out_arr = np.array(out)
def test_large_image_safe_and_scribble():
"""Test with large image and both flags True."""
model = ControlNetHED_Apache2()
detector = HEDEdgeDetector(model)
arr = np.full((256, 512, 3), 200, dtype=np.uint8)
img = Image.fromarray(arr)
codeflash_output = detector.run(img, safe=True, scribble=True); out = codeflash_output # 5.43ms -> 5.20ms (4.42% faster)
def test_large_random_image():
"""Test with a large random image."""
model = ControlNetHED_Apache2()
detector = HEDEdgeDetector(model)
rng = np.random.default_rng(42)
arr = rng.integers(0, 256, size=(300, 400, 3), dtype=np.uint8)
img = Image.fromarray(arr)
codeflash_output = detector.run(img); out = codeflash_output # 3.23ms -> 2.77ms (16.7% faster)
out_arr = np.array(out)
def test_large_edge_map_values():
"""Test that output edge map values are within expected bounds for large image."""
model = ControlNetHED_Apache2()
detector = HEDEdgeDetector(model)
arr = np.full((999, 999, 3), 180, dtype=np.uint8)
img = Image.fromarray(arr)
codeflash_output = detector.run(img); out = codeflash_output # 28.1ms -> 25.3ms (11.0% faster)
out_arr = np.array(out)
codeflash_output is used to check that the output of the original code is the same as that of the optimized code.
To edit these changes
git checkout codeflash/optimize-HEDEdgeDetector.run-mhvo3ag6and push.