⚡️ Speed up method CogView4DenoiseInvocation._prepare_cfg_scale by 27%
#141
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
📄 27% (0.27x) speedup for
CogView4DenoiseInvocation._prepare_cfg_scaleininvokeai/app/invocations/cogview4_denoise.py⏱️ Runtime :
39.1 microseconds→30.8 microseconds(best of157runs)📝 Explanation and details
The optimized code achieves a 27% speedup through three key micro-optimizations:
1. Reduced Attribute Lookups:
The original code calls
self.cfg_scalemultiple times (up to 3x in some code paths). The optimization stores it once ascfg_scale = self.cfg_scale, eliminating redundant attribute access overhead. This is particularly beneficial since attribute lookups in Python involve dictionary operations.2. Faster Type Checking:
Replaced
isinstance(self.cfg_scale, float/list)withtype(cfg_scale) is float/list. Thetype() ispattern is faster because it performs exact type matching without inheritance checks, avoiding the more complex isinstance() machinery. Since this code only needs to distinguish between exactlyfloatandlisttypes, this optimization is safe and effective.3. Early Returns:
Changed the control flow to return directly from each branch instead of assigning to a variable and returning at the end. This eliminates the extra variable assignment and the final return statement execution in most cases.
Performance Impact by Test Case:
The optimizations show consistent improvements across all test scenarios:
Why This Matters:
These micro-optimizations are particularly valuable in image generation pipelines where
_prepare_cfg_scalecould be called frequently during the denoising process. The consistent speedups across all test cases indicate the optimizations don't introduce performance regressions in any scenario while providing meaningful gains in a potentially hot code path.✅ Correctness verification report:
🌀 Generated Regression Tests and Runtime
import pytest
from invokeai.app.invocations.cogview4_denoise import CogView4DenoiseInvocation
unit tests
--- BASIC TEST CASES ---
def test_float_cfg_scale_basic():
# Test with float cfg_scale and small num_timesteps
invocation = CogView4DenoiseInvocation(cfg_scale=2.5)
codeflash_output = invocation._prepare_cfg_scale(3); result = codeflash_output # 1.23μs -> 914ns (34.8% faster)
def test_list_cfg_scale_basic():
# Test with list cfg_scale of correct length
invocation = CogView4DenoiseInvocation(cfg_scale=[1.0, 2.0, 3.0])
codeflash_output = invocation._prepare_cfg_scale(3); result = codeflash_output # 1.11μs -> 759ns (45.7% faster)
def test_float_cfg_scale_one_step():
# Test with float cfg_scale and one timestep
invocation = CogView4DenoiseInvocation(cfg_scale=4.2)
codeflash_output = invocation._prepare_cfg_scale(1); result = codeflash_output # 949ns -> 783ns (21.2% faster)
def test_list_cfg_scale_one_step():
# Test with list cfg_scale and one timestep
invocation = CogView4DenoiseInvocation(cfg_scale=[7.7])
codeflash_output = invocation._prepare_cfg_scale(1); result = codeflash_output # 1.07μs -> 723ns (47.4% faster)
--- EDGE TEST CASES ---
def test_list_cfg_scale_wrong_length_raises():
# Test with list cfg_scale of wrong length (should raise AssertionError)
invocation = CogView4DenoiseInvocation(cfg_scale=[1.0, 2.0])
with pytest.raises(AssertionError):
invocation._prepare_cfg_scale(3) # 1.54μs -> 1.35μs (14.1% faster)
def test_cfg_scale_zero_timesteps_float():
# Test with zero timesteps and float cfg_scale
invocation = CogView4DenoiseInvocation(cfg_scale=1.5)
codeflash_output = invocation._prepare_cfg_scale(0); result = codeflash_output # 1.02μs -> 808ns (25.9% faster)
def test_cfg_scale_zero_timesteps_list():
# Test with zero timesteps and empty list cfg_scale
invocation = CogView4DenoiseInvocation(cfg_scale=[])
codeflash_output = invocation._prepare_cfg_scale(0); result = codeflash_output # 1.05μs -> 732ns (44.1% faster)
def test_cfg_scale_zero_timesteps_list_nonempty():
# Test with zero timesteps and non-empty list cfg_scale (should raise AssertionError)
invocation = CogView4DenoiseInvocation(cfg_scale=[1.0])
with pytest.raises(AssertionError):
invocation._prepare_cfg_scale(0) # 1.52μs -> 1.24μs (22.7% faster)
def test_cfg_scale_invalid_type_int():
# Test with invalid cfg_scale type (int)
invocation = CogView4DenoiseInvocation(cfg_scale=5)
with pytest.raises(ValueError):
invocation._prepare_cfg_scale(2)
def test_float_cfg_scale_large_timesteps():
# Test with float cfg_scale and large num_timesteps
large_steps = 999
invocation = CogView4DenoiseInvocation(cfg_scale=2.0)
codeflash_output = invocation._prepare_cfg_scale(large_steps); result = codeflash_output # 1.55μs -> 1.26μs (22.6% faster)
def test_list_cfg_scale_large_timesteps():
# Test with large list cfg_scale
large_steps = 999
cfg_list = [float(i) for i in range(large_steps)]
invocation = CogView4DenoiseInvocation(cfg_scale=cfg_list)
codeflash_output = invocation._prepare_cfg_scale(large_steps); result = codeflash_output # 1.22μs -> 870ns (40.5% faster)
def test_cfg_scale_list_length_one_large_timesteps():
# Test with list of length one and large timesteps (should raise AssertionError)
invocation = CogView4DenoiseInvocation(cfg_scale=[1.0])
with pytest.raises(AssertionError):
invocation._prepare_cfg_scale(999) # 1.56μs -> 1.38μs (13.2% faster)
def test_cfg_scale_empty_list_large_timesteps():
# Test with empty list and large timesteps (should raise AssertionError)
invocation = CogView4DenoiseInvocation(cfg_scale=[])
with pytest.raises(AssertionError):
invocation._prepare_cfg_scale(999) # 1.52μs -> 1.36μs (12.2% faster)
--- MISCELLANEOUS CASES ---
def test_cfg_scale_list_with_boolean_element():
# Test with list containing a boolean element
invocation = CogView4DenoiseInvocation(cfg_scale=[2.0, True, 4.0])
codeflash_output = invocation._prepare_cfg_scale(3); result = codeflash_output # 1.30μs -> 882ns (47.6% faster)
def test_cfg_scale_list_with_negative_elements():
# Test with list containing negative floats
invocation = CogView4DenoiseInvocation(cfg_scale=[-1.0, -2.5, -3.3])
codeflash_output = invocation._prepare_cfg_scale(3); result = codeflash_output # 1.08μs -> 770ns (39.9% faster)
def test_cfg_scale_float_negative():
# Test with negative float cfg_scale
invocation = CogView4DenoiseInvocation(cfg_scale=-5.5)
codeflash_output = invocation._prepare_cfg_scale(2); result = codeflash_output # 981ns -> 815ns (20.4% faster)
def test_cfg_scale_float_zero():
# Test with float cfg_scale set to zero
invocation = CogView4DenoiseInvocation(cfg_scale=0.0)
codeflash_output = invocation._prepare_cfg_scale(5); result = codeflash_output # 956ns -> 764ns (25.1% faster)
codeflash_output is used to check that the output of the original code is the same as that of the optimized code.
#------------------------------------------------
from typing import Optional
imports
import pytest
from invokeai.app.invocations.cogview4_denoise import CogView4DenoiseInvocation
unit tests
---------- BASIC TEST CASES ----------
def test_float_cfg_scale_basic():
"""Test: cfg_scale is a float, num_timesteps is a typical positive integer."""
invocation = CogView4DenoiseInvocation(cfg_scale=2.5)
codeflash_output = invocation._prepare_cfg_scale(5); result = codeflash_output # 1.40μs -> 897ns (55.6% faster)
def test_list_cfg_scale_basic():
"""Test: cfg_scale is a list of floats, length matches num_timesteps."""
invocation = CogView4DenoiseInvocation(cfg_scale=[1.0, 2.0, 3.0, 4.0])
codeflash_output = invocation._prepare_cfg_scale(4); result = codeflash_output # 1.14μs -> 747ns (52.1% faster)
def test_float_cfg_scale_one_timestep():
"""Test: cfg_scale is a float, num_timesteps is 1."""
invocation = CogView4DenoiseInvocation(cfg_scale=7.0)
codeflash_output = invocation._prepare_cfg_scale(1); result = codeflash_output # 1.00μs -> 777ns (28.8% faster)
def test_list_cfg_scale_one_timestep():
"""Test: cfg_scale is a list with one element, num_timesteps is 1."""
invocation = CogView4DenoiseInvocation(cfg_scale=[8.5])
codeflash_output = invocation._prepare_cfg_scale(1); result = codeflash_output # 1.02μs -> 763ns (33.7% faster)
---------- EDGE TEST CASES ----------
def test_list_cfg_scale_length_mismatch_shorter():
"""Test: cfg_scale is a list shorter than num_timesteps (should raise AssertionError)."""
invocation = CogView4DenoiseInvocation(cfg_scale=[1.0, 2.0])
with pytest.raises(AssertionError):
invocation._prepare_cfg_scale(3) # 1.50μs -> 1.39μs (8.36% faster)
def test_list_cfg_scale_length_mismatch_longer():
"""Test: cfg_scale is a list longer than num_timesteps (should raise AssertionError)."""
invocation = CogView4DenoiseInvocation(cfg_scale=[1.0, 2.0, 3.0])
with pytest.raises(AssertionError):
invocation._prepare_cfg_scale(2) # 1.42μs -> 1.23μs (15.7% faster)
def test_cfg_scale_invalid_type_int():
"""Test: cfg_scale is an int (invalid type, should raise ValueError)."""
invocation = CogView4DenoiseInvocation(cfg_scale=5)
with pytest.raises(ValueError):
invocation._prepare_cfg_scale(3)
def test_zero_timesteps_with_float_cfg_scale():
"""Test: num_timesteps is zero, cfg_scale is float (should return empty list)."""
invocation = CogView4DenoiseInvocation(cfg_scale=2.5)
codeflash_output = invocation._prepare_cfg_scale(0); result = codeflash_output # 1.20μs -> 914ns (31.7% faster)
def test_zero_timesteps_with_list_cfg_scale():
"""Test: num_timesteps is zero, cfg_scale is an empty list."""
invocation = CogView4DenoiseInvocation(cfg_scale=[])
codeflash_output = invocation._prepare_cfg_scale(0); result = codeflash_output # 1.13μs -> 765ns (47.8% faster)
def test_negative_timesteps_float_cfg_scale():
"""Test: num_timesteps is negative, cfg_scale is float (should return list of negative length, which is empty)."""
invocation = CogView4DenoiseInvocation(cfg_scale=1.0)
codeflash_output = invocation._prepare_cfg_scale(-2); result = codeflash_output # 996ns -> 882ns (12.9% faster)
def test_negative_timesteps_list_cfg_scale():
"""Test: num_timesteps is negative, cfg_scale is a list (should raise AssertionError)."""
invocation = CogView4DenoiseInvocation(cfg_scale=[1.0, 2.0])
with pytest.raises(AssertionError):
invocation._prepare_cfg_scale(-2) # 1.57μs -> 1.33μs (18.2% faster)
---------- LARGE SCALE TEST CASES ----------
def test_large_float_cfg_scale():
"""Test: cfg_scale is float, num_timesteps is large (1000)."""
invocation = CogView4DenoiseInvocation(cfg_scale=3.3)
codeflash_output = invocation._prepare_cfg_scale(1000); result = codeflash_output # 1.45μs -> 1.24μs (17.7% faster)
def test_large_list_cfg_scale():
"""Test: cfg_scale is a large list (1000 elements), num_timesteps matches."""
large_list = [float(i) for i in range(1000)]
invocation = CogView4DenoiseInvocation(cfg_scale=large_list)
codeflash_output = invocation._prepare_cfg_scale(1000); result = codeflash_output # 1.10μs -> 811ns (36.3% faster)
def test_large_list_cfg_scale_length_mismatch():
"""Test: cfg_scale is a large list (1000 elements), num_timesteps is less (999), should raise AssertionError."""
large_list = [float(i) for i in range(1000)]
invocation = CogView4DenoiseInvocation(cfg_scale=large_list)
with pytest.raises(AssertionError):
invocation._prepare_cfg_scale(999) # 1.59μs -> 1.38μs (15.5% faster)
def test_large_empty_list_cfg_scale():
"""Test: cfg_scale is an empty list, num_timesteps is 0."""
invocation = CogView4DenoiseInvocation(cfg_scale=[])
codeflash_output = invocation._prepare_cfg_scale(0); result = codeflash_output # 1.04μs -> 823ns (26.7% faster)
def test_large_float_cfg_scale_zero_timesteps():
"""Test: cfg_scale is float, num_timesteps is 0 (should return empty list)."""
invocation = CogView4DenoiseInvocation(cfg_scale=5.5)
codeflash_output = invocation._prepare_cfg_scale(0); result = codeflash_output # 898ns -> 751ns (19.6% faster)
---------- ADDITIONAL EDGE CASES ----------
def test_cfg_scale_list_of_ints():
"""Test: cfg_scale is a list of ints (should not raise, but returns list as-is)."""
invocation = CogView4DenoiseInvocation(cfg_scale=[1, 2, 3])
codeflash_output = invocation._prepare_cfg_scale(3); result = codeflash_output # 1.01μs -> 709ns (42.3% faster)
def test_cfg_scale_tuple_type():
"""Test: cfg_scale is a tuple, should raise ValueError."""
invocation = CogView4DenoiseInvocation(cfg_scale=(1.0, 2.0))
with pytest.raises(ValueError):
invocation._prepare_cfg_scale(2)
codeflash_output is used to check that the output of the original code is the same as that of the optimized code.
To edit these changes
git checkout codeflash/optimize-CogView4DenoiseInvocation._prepare_cfg_scale-mhvlzzz9and push.