⚡️ Speed up method `HEDEdgeDetector.run` by 13% #143

codeflash-ai · 2025-11-12T07:17:13Z

📄 13% (0.13x) speedup for `HEDEdgeDetector.run` in `invokeai/backend/image_util/hed.py`

⏱️ Runtime : 84.5 milliseconds → 74.8 milliseconds (best of 58 runs)

📝 Explanation and details

The optimized code achieves a 13% speedup through several targeted NumPy optimizations that reduce memory allocations and leverage in-place operations:

Key Optimizations Applied

1. Memory-Efficient Array Operations (42.6% → 38.4% of runtime)
The most significant bottleneck - the sigmoid computation 1 / (1 + np.exp(-np.mean(edges, axis=2))) - was optimized using in-place NumPy operations:

np.exp(-edge_map, out=edge_map) - computes exp in-place
np.add(edge_map, 1, out=edge_map) - adds 1 in-place
np.reciprocal(edge_map, out=edge_map) - computes 1/x in-place

This eliminates temporary array allocations during the most compute-intensive operation.

2. Reduced Memory Copies

np.asarray() instead of np.array() in pil_to_np() avoids unnecessary copying when PIL data is already a compatible array
Removed .copy() call in torch.from_numpy() since the tensor conversion doesn't require ownership

3. Pre-allocated Edge Processing
Replaced list comprehensions and np.stack() with a pre-allocated array and direct assignment:

resized_edges = np.empty((height, width, n_edges), dtype=np.float32)
for idx, e in enumerate(edges_out):
    # Direct assignment to pre-allocated array
    resized_edges[:, :, idx] = processed_edge

4. Optimized NMS Function

Cached static filter arrays as module-level constants (_NMS_FILTERS) to avoid repeated allocation
Used np.putmask() for thresholding instead of boolean indexing, reducing temporary array creation

5. Vectorized Scribble Processing
Replaced two separate boolean indexing operations with a single np.where() call, eliminating intermediate array creation.

Performance Impact by Test Case

The optimizations show consistent 5-18% improvements across all test scenarios, with larger images benefiting most (up to 17.7% for 512x512 images). The gains are particularly notable for:

Large images where memory allocation overhead is significant
Operations involving the scribble=True flag (10-11% improvement)
Batch processing scenarios (18.5% improvement)

These optimizations are especially valuable in image processing pipelines where edge detection may be called repeatedly on large images.

✅ Correctness verification report:

Test	Status
⚙️ Existing Unit Tests	🔘 None Found
🌀 Generated Regression Tests	✅ 83 Passed
⏪ Replay Tests	🔘 None Found
🔎 Concolic Coverage Tests	🔘 None Found
📊 Tests Coverage	100.0%

🌀 Generated Regression Tests and Runtime

import itertools

Function to test

import cv2
import numpy as np

imports

import pytest
import torch
from einops import rearrange
from invokeai.backend.image_util.hed import HEDEdgeDetector
from PIL import Image

Minimal stub for ControlNetHED_Apache2 to allow testing

class ControlNetHED_Apache2(torch.nn.Module):
"""Stub model that mimics HED behavior for testing."""

def forward(self, x):
    # x shape: (1, 3, H, W)
    # Return three edge maps with shape (1, 1, H, W)
    # For deterministic results, use simple patterns
    b, c, h, w = x.shape
    # Edge map 1: vertical gradient
    edge1 = torch.linspace(0, 1, w).repeat(h, 1).unsqueeze(0).unsqueeze(0)
    # Edge map 2: horizontal gradient
    edge2 = torch.linspace(0, 1, h).repeat(w, 1).t().unsqueeze(0).unsqueeze(0)
    # Edge map 3: constant
    edge3 = torch.ones(1, 1, h, w) * 0.5
    return [edge1, edge2, edge3]

from invokeai.backend.image_util.hed import HEDEdgeDetector

Helper to create images

def make_image(width, height, color=(128, 128, 128)):
"""Create a solid color RGB PIL image of given size."""
arr = np.full((height, width, 3), color, dtype=np.uint8)
return Image.fromarray(arr)

def make_gradient_image(width, height):
"""Create a horizontal gradient RGB PIL image."""
arr = np.zeros((height, width, 3), dtype=np.uint8)
for x in range(width):
arr[:, x, :] = int(255 * x / (width - 1))
return Image.fromarray(arr)

def make_noise_image(width, height, seed=42):
"""Create a random noise RGB PIL image."""
rng = np.random.RandomState(seed)
arr = rng.randint(0, 256, (height, width, 3), dtype=np.uint8)
return Image.fromarray(arr)

Basic Test Cases

def test_run_on_solid_color_image():
"""Test edge detection on a solid color image (should be mostly flat)."""
model = ControlNetHED_Apache2()
detector = HEDEdgeDetector(model)
img = make_image(32, 32, color=(128, 128, 128))
codeflash_output = detector.run(img); out = codeflash_output # 374μs -> 339μs (10.3% faster)
arr = np.array(out)

def test_run_on_gradient_image():
"""Test edge detection on a horizontal gradient image."""
model = ControlNetHED_Apache2()
detector = HEDEdgeDetector(model)
img = make_gradient_image(32, 32)
codeflash_output = detector.run(img); out = codeflash_output # 337μs -> 315μs (7.13% faster)
arr = np.array(out)

def test_run_on_noise_image():
"""Test edge detection on a random noise image."""
model = ControlNetHED_Apache2()
detector = HEDEdgeDetector(model)
img = make_noise_image(32, 32)
codeflash_output = detector.run(img); out = codeflash_output # 331μs -> 315μs (4.83% faster)
arr = np.array(out)

def test_run_safe_flag_changes_output():
"""Test that safe=True changes the output."""
model = ControlNetHED_Apache2()
detector = HEDEdgeDetector(model)
img = make_gradient_image(32, 32)
codeflash_output = detector.run(img, safe=False); out1 = codeflash_output # 332μs -> 314μs (5.86% faster)
codeflash_output = detector.run(img, safe=True); out2 = codeflash_output # 225μs -> 204μs (9.86% faster)
arr1 = np.array(out1)
arr2 = np.array(out2)

def test_run_scribble_flag_changes_output():
"""Test that scribble=True changes the output."""
model = ControlNetHED_Apache2()
detector = HEDEdgeDetector(model)
img = make_gradient_image(32, 32)
codeflash_output = detector.run(img, scribble=False); out1 = codeflash_output # 328μs -> 309μs (6.14% faster)
codeflash_output = detector.run(img, scribble=True); out2 = codeflash_output # 429μs -> 387μs (10.9% faster)
arr1 = np.array(out1)
arr2 = np.array(out2)

def test_run_safe_and_scribble_flags_together():
"""Test that both flags together work and change output."""
model = ControlNetHED_Apache2()
detector = HEDEdgeDetector(model)
img = make_gradient_image(32, 32)
codeflash_output = detector.run(img, safe=True, scribble=True); out1 = codeflash_output # 543μs -> 516μs (5.17% faster)
codeflash_output = detector.run(img, safe=False, scribble=False); out2 = codeflash_output # 294μs -> 271μs (8.57% faster)
arr1 = np.array(out1)
arr2 = np.array(out2)

Edge Test Cases

def test_run_on_minimal_image():
"""Test on 1x1 image (edge case: smallest possible)."""
model = ControlNetHED_Apache2()
detector = HEDEdgeDetector(model)
img = make_image(1, 1, color=(0, 0, 0))
codeflash_output = detector.run(img); out = codeflash_output # 302μs -> 283μs (6.64% faster)
arr = np.array(out)

def test_run_on_non_square_image():
"""Test on a non-square image (e.g., 16x32)."""
model = ControlNetHED_Apache2()
detector = HEDEdgeDetector(model)
img = make_image(16, 32, color=(255, 0, 0))
codeflash_output = detector.run(img); out = codeflash_output # 312μs -> 302μs (3.33% faster)
arr = np.array(out)

def test_run_on_max_channel_value_image():
"""Test on image with max channel values (255,255,255)."""
model = ControlNetHED_Apache2()
detector = HEDEdgeDetector(model)
img = make_image(32, 32, color=(255, 255, 255))
codeflash_output = detector.run(img); out = codeflash_output # 330μs -> 309μs (6.94% faster)
arr = np.array(out)

def test_run_on_min_channel_value_image():
"""Test on image with min channel values (0,0,0)."""
model = ControlNetHED_Apache2()
detector = HEDEdgeDetector(model)
img = make_image(32, 32, color=(0, 0, 0))
codeflash_output = detector.run(img); out = codeflash_output # 329μs -> 311μs (5.81% faster)
arr = np.array(out)

def test_run_on_alpha_channel_image():
"""Test on RGBA image (should ignore alpha)."""
model = ControlNetHED_Apache2()
detector = HEDEdgeDetector(model)
arr = np.full((32, 32, 4), 128, dtype=np.uint8)
img = Image.fromarray(arr, mode="RGBA")
# Convert to RGB for testing (since original expects RGB)
img = img.convert("RGB")
codeflash_output = detector.run(img); out = codeflash_output # 327μs -> 311μs (5.14% faster)
arr_out = np.array(out)

def test_run_on_large_but_small_memory_image():
"""Test on image near 100MB limit (e.g., 512x512x3)."""
model = ControlNetHED_Apache2()
detector = HEDEdgeDetector(model)
img = make_image(512, 512, color=(64, 128, 192))
codeflash_output = detector.run(img); out = codeflash_output # 7.20ms -> 6.17ms (16.7% faster)
arr = np.array(out)
# Should not crash or exceed memory

def test_run_scribble_flag_output_is_binary():
"""Test that scribble output is binary (0 or 255)."""
model = ControlNetHED_Apache2()
detector = HEDEdgeDetector(model)
img = make_gradient_image(32, 32)
codeflash_output = detector.run(img, scribble=True); out = codeflash_output # 566μs -> 514μs (10.0% faster)
arr = np.array(out)

def test_run_on_large_noise_image():
"""Test on a large random noise image (512x512)."""
model = ControlNetHED_Apache2()
detector = HEDEdgeDetector(model)
img = make_noise_image(512, 512)
codeflash_output = detector.run(img); out = codeflash_output # 7.15ms -> 6.13ms (16.6% faster)
arr = np.array(out)

def test_run_on_multiple_images():
"""Test running on multiple images in succession (stress test)."""
model = ControlNetHED_Apache2()
detector = HEDEdgeDetector(model)
for i in range(10):
img = make_noise_image(128, 128, seed=i)
codeflash_output = detector.run(img); out = codeflash_output # 6.55ms -> 5.52ms (18.5% faster)
arr = np.array(out)

def test_run_on_maximum_allowed_size():
"""Test on image of maximum allowed size (e.g., 1024x32)."""
model = ControlNetHED_Apache2()
detector = HEDEdgeDetector(model)
img = make_image(1024, 32, color=(0, 255, 0))
codeflash_output = detector.run(img); out = codeflash_output # 1.14ms -> 1.00ms (14.1% faster)
arr = np.array(out)
# Should not crash

def test_run_on_minimum_allowed_size():
"""Test on image of minimum allowed size (1x1)."""
model = ControlNetHED_Apache2()
detector = HEDEdgeDetector(model)
img = make_image(1, 1, color=(255, 0, 0))
codeflash_output = detector.run(img); out = codeflash_output # 320μs -> 290μs (10.6% faster)
arr = np.array(out)

def test_run_on_various_sizes():
"""Test on a variety of sizes (16x16, 64x64, 256x256, 512x512)."""
model = ControlNetHED_Apache2()
detector = HEDEdgeDetector(model)
for sz in [16, 64, 256, 512]:
img = make_gradient_image(sz, sz)
codeflash_output = detector.run(img); out = codeflash_output # 9.65ms -> 8.20ms (17.7% faster)
arr = np.array(out)

codeflash_output is used to check that the output of the original code is the same as that of the optimized code.

#------------------------------------------------
import itertools

Import HEDEdgeDetector from the provided code

import cv2
import numpy as np

imports

import pytest # used for our unit tests
import torch
from einops import rearrange
from invokeai.backend.image_util.hed import HEDEdgeDetector
from PIL import Image

Minimal stub for ControlNetHED_Apache2 to simulate model behavior

class ControlNetHED_Apache2(torch.nn.Module):
def forward(self, x):
# Simulate output: return 3 tensors, each with shape [1, 1, h, w]
# Output values are deterministic for testing
b, c, h, w = x.shape
# Each edge map is a constant array for simplicity
edge1 = torch.ones((1, 1, h, w), dtype=torch.float32) * 0.5
edge2 = torch.ones((1, 1, h, w), dtype=torch.float32) * 1.0
edge3 = torch.ones((1, 1, h, w), dtype=torch.float32) * 2.0
return [edge1, edge2, edge3]
from invokeai.backend.image_util.hed import HEDEdgeDetector

-------------------- UNIT TESTS --------------------

Helper for comparing PIL images

def images_equal(img1, img2):
arr1 = np.array(img1)
arr2 = np.array(img2)
return np.array_equal(arr1, arr2)

Basic test cases

def test_basic_grayscale_image():
"""Test with a simple grayscale image."""
model = ControlNetHED_Apache2()
detector = HEDEdgeDetector(model)
arr = np.full((32, 32, 3), 128, dtype=np.uint8)
img = Image.fromarray(arr)
codeflash_output = detector.run(img); out = codeflash_output # 338μs -> 305μs (10.8% faster)

def test_basic_color_image():
"""Test with a simple color image."""
model = ControlNetHED_Apache2()
detector = HEDEdgeDetector(model)
arr = np.zeros((16, 16, 3), dtype=np.uint8)
arr[:8, :8] = [255, 0, 0] # Red block
img = Image.fromarray(arr)
codeflash_output = detector.run(img); out = codeflash_output # 291μs -> 267μs (9.07% faster)
# Output array should have values in [0, 255]
out_arr = np.array(out)

def test_basic_safe_flag():
"""Test with safe=True."""
model = ControlNetHED_Apache2()
detector = HEDEdgeDetector(model)
arr = np.full((10, 10, 3), 200, dtype=np.uint8)
img = Image.fromarray(arr)
codeflash_output = detector.run(img, safe=True); out_safe = codeflash_output # 285μs -> 268μs (6.10% faster)
codeflash_output = detector.run(img, safe=False); out_normal = codeflash_output # 174μs -> 153μs (13.7% faster)

def test_basic_scribble_flag():
"""Test with scribble=True."""
model = ControlNetHED_Apache2()
detector = HEDEdgeDetector(model)
arr = np.full((12, 12, 3), 50, dtype=np.uint8)
img = Image.fromarray(arr)
codeflash_output = detector.run(img, scribble=True); out_scribble = codeflash_output # 404μs -> 380μs (6.13% faster)
codeflash_output = detector.run(img, scribble=False); out_normal = codeflash_output # 241μs -> 221μs (9.25% faster)

def test_basic_safe_and_scribble():
"""Test with both safe and scribble True."""
model = ControlNetHED_Apache2()
detector = HEDEdgeDetector(model)
arr = np.full((8, 8, 3), 100, dtype=np.uint8)
img = Image.fromarray(arr)
codeflash_output = detector.run(img, safe=True, scribble=True); out = codeflash_output # 383μs -> 361μs (6.08% faster)

Edge test cases

def test_edge_minimal_image():
"""Test with minimal 1x1 image."""
model = ControlNetHED_Apache2()
detector = HEDEdgeDetector(model)
arr = np.array([[[0, 0, 0]]], dtype=np.uint8)
img = Image.fromarray(arr)
codeflash_output = detector.run(img); out = codeflash_output # 273μs -> 260μs (4.97% faster)
out_arr = np.array(out)

def test_edge_max_value_image():
"""Test with all pixels at max value."""
model = ControlNetHED_Apache2()
detector = HEDEdgeDetector(model)
arr = np.full((5, 5, 3), 255, dtype=np.uint8)
img = Image.fromarray(arr)
codeflash_output = detector.run(img); out = codeflash_output # 269μs -> 256μs (5.10% faster)
out_arr = np.array(out)

def test_edge_min_value_image():
"""Test with all pixels at min value."""
model = ControlNetHED_Apache2()
detector = HEDEdgeDetector(model)
arr = np.zeros((5, 5, 3), dtype=np.uint8)
img = Image.fromarray(arr)
codeflash_output = detector.run(img); out = codeflash_output # 273μs -> 263μs (4.03% faster)
out_arr = np.array(out)

def test_edge_non_square_image():
"""Test with non-square image."""
model = ControlNetHED_Apache2()
detector = HEDEdgeDetector(model)
arr = np.full((10, 20, 3), 123, dtype=np.uint8)
img = Image.fromarray(arr)
codeflash_output = detector.run(img); out = codeflash_output # 278μs -> 259μs (7.28% faster)

def test_edge_invalid_dtype():
"""Test with float32 image array."""
model = ControlNetHED_Apache2()
detector = HEDEdgeDetector(model)
arr = np.full((10, 10, 3), 0.5, dtype=np.float32)
img = Image.fromarray(arr.astype(np.uint8))
codeflash_output = detector.run(img); out = codeflash_output # 295μs -> 278μs (6.04% faster)

def test_large_image():
"""Test with a large image (max 1000x1000, <100MB)."""
model = ControlNetHED_Apache2()
detector = HEDEdgeDetector(model)
arr = np.full((512, 512, 3), 128, dtype=np.uint8)
img = Image.fromarray(arr)
codeflash_output = detector.run(img); out = codeflash_output # 6.91ms -> 5.87ms (17.7% faster)
out_arr = np.array(out)

def test_large_image_safe_and_scribble():
"""Test with large image and both flags True."""
model = ControlNetHED_Apache2()
detector = HEDEdgeDetector(model)
arr = np.full((256, 512, 3), 200, dtype=np.uint8)
img = Image.fromarray(arr)
codeflash_output = detector.run(img, safe=True, scribble=True); out = codeflash_output # 5.43ms -> 5.20ms (4.42% faster)

def test_large_random_image():
"""Test with a large random image."""
model = ControlNetHED_Apache2()
detector = HEDEdgeDetector(model)
rng = np.random.default_rng(42)
arr = rng.integers(0, 256, size=(300, 400, 3), dtype=np.uint8)
img = Image.fromarray(arr)
codeflash_output = detector.run(img); out = codeflash_output # 3.23ms -> 2.77ms (16.7% faster)
out_arr = np.array(out)

def test_large_edge_map_values():
"""Test that output edge map values are within expected bounds for large image."""
model = ControlNetHED_Apache2()
detector = HEDEdgeDetector(model)
arr = np.full((999, 999, 3), 180, dtype=np.uint8)
img = Image.fromarray(arr)
codeflash_output = detector.run(img); out = codeflash_output # 28.1ms -> 25.3ms (11.0% faster)
out_arr = np.array(out)

codeflash_output is used to check that the output of the original code is the same as that of the optimized code.

To edit these changes git checkout codeflash/optimize-HEDEdgeDetector.run-mhvo3ag6 and push.

The optimized code achieves a **13% speedup** through several targeted NumPy optimizations that reduce memory allocations and leverage in-place operations: ## Key Optimizations Applied **1. Memory-Efficient Array Operations (42.6% → 38.4% of runtime)** The most significant bottleneck - the sigmoid computation `1 / (1 + np.exp(-np.mean(edges, axis=2)))` - was optimized using in-place NumPy operations: - `np.exp(-edge_map, out=edge_map)` - computes exp in-place - `np.add(edge_map, 1, out=edge_map)` - adds 1 in-place - `np.reciprocal(edge_map, out=edge_map)` - computes 1/x in-place This eliminates temporary array allocations during the most compute-intensive operation. **2. Reduced Memory Copies** - `np.asarray()` instead of `np.array()` in `pil_to_np()` avoids unnecessary copying when PIL data is already a compatible array - Removed `.copy()` call in `torch.from_numpy()` since the tensor conversion doesn't require ownership **3. Pre-allocated Edge Processing** Replaced list comprehensions and `np.stack()` with a pre-allocated array and direct assignment: ```python resized_edges = np.empty((height, width, n_edges), dtype=np.float32) for idx, e in enumerate(edges_out): # Direct assignment to pre-allocated array resized_edges[:, :, idx] = processed_edge ``` **4. Optimized NMS Function** - Cached static filter arrays as module-level constants (`_NMS_FILTERS`) to avoid repeated allocation - Used `np.putmask()` for thresholding instead of boolean indexing, reducing temporary array creation **5. Vectorized Scribble Processing** Replaced two separate boolean indexing operations with a single `np.where()` call, eliminating intermediate array creation. ## Performance Impact by Test Case The optimizations show consistent **5-18% improvements** across all test scenarios, with larger images benefiting most (up to 17.7% for 512x512 images). The gains are particularly notable for: - Large images where memory allocation overhead is significant - Operations involving the `scribble=True` flag (10-11% improvement) - Batch processing scenarios (18.5% improvement) These optimizations are especially valuable in image processing pipelines where edge detection may be called repeatedly on large images.

codeflash-ai bot requested a review from mashraf-222 November 12, 2025 07:17

codeflash-ai bot added ⚡️ codeflash Optimization PR opened by Codeflash AI 🎯 Quality: High Optimization Quality according to Codeflash labels Nov 12, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

⚡️ Speed up method `HEDEdgeDetector.run` by 13% #143

⚡️ Speed up method `HEDEdgeDetector.run` by 13% #143

Uh oh!

codeflash-ai bot commented Nov 12, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

⚡️ Speed up method HEDEdgeDetector.run by 13% #143

Are you sure you want to change the base?

⚡️ Speed up method HEDEdgeDetector.run by 13% #143

Uh oh!

Conversation

codeflash-ai bot commented Nov 12, 2025

📄 13% (0.13x) speedup for HEDEdgeDetector.run in invokeai/backend/image_util/hed.py

📝 Explanation and details

Key Optimizations Applied

Performance Impact by Test Case

Function to test

imports

Minimal stub for ControlNetHED_Apache2 to allow testing

Helper to create images

Basic Test Cases

Edge Test Cases

codeflash_output is used to check that the output of the original code is the same as that of the optimized code.

Import HEDEdgeDetector from the provided code

imports

Minimal stub for ControlNetHED_Apache2 to simulate model behavior

-------------------- UNIT TESTS --------------------

Helper for comparing PIL images

Basic test cases

Edge test cases

codeflash_output is used to check that the output of the original code is the same as that of the optimized code.

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

⚡️ Speed up method `HEDEdgeDetector.run` by 13% #143

⚡️ Speed up method `HEDEdgeDetector.run` by 13% #143

📄 13% (0.13x) speedup for `HEDEdgeDetector.run` in `invokeai/backend/image_util/hed.py`