Skip to content

Conversation

@codeflash-ai
Copy link

@codeflash-ai codeflash-ai bot commented Oct 21, 2025

📄 670% (6.70x) speedup for convert_node_to_data_point in cognee/modules/graph/utils/convert_node_to_data_point.py

⏱️ Runtime : 209 microseconds 27.2 microseconds (best of 63 runs)

📝 Explanation and details

The optimization introduces a module-level cache (_SUBCLASS_CACHE) that eliminates the expensive repeated traversal of the class hierarchy.

Key Performance Problem: The original code called get_all_subclasses(cls) on every invocation of find_subclass_by_name, which recursively walks the entire subclass tree. The line profiler shows this accounts for 78.8% of the execution time (1,155 hits calling get_all_subclasses).

Optimization Strategy:

  • Lazy caching: Build a dictionary mapping class names to subclasses only once per base class
  • O(1) lookup: Replace linear search through subclasses with dictionary lookup using cache.get(name, None)

Performance Impact:

  • Original: 1,155 calls to iterate through subclasses (78.8% of time)
  • Optimized: Only 1 call to get_all_subclasses to build the cache, then O(1) dictionary lookups

Test Case Performance: The optimization is most effective for scenarios with unknown or invalid type names (667-800% speedup), where the original code would traverse the entire subclass hierarchy before returning None. Valid type lookups also benefit significantly from the O(1) dictionary access pattern.

This caching approach scales particularly well when the same base class is used repeatedly, as subsequent calls avoid the expensive recursive traversal entirely.

Correctness verification report:

Test Status
⚙️ Existing Unit Tests 🔘 None Found
🌀 Generated Regression Tests 15 Passed
⏪ Replay Tests 🔘 None Found
🔎 Concolic Coverage Tests 🔘 None Found
📊 Tests Coverage 100.0%
🌀 Generated Regression Tests and Runtime
import pytest
from cognee.modules.graph.utils.convert_node_to_data_point import \
    convert_node_to_data_point

# --- Begin: Minimal DataPoint hierarchy for testing purposes ---

class DataPoint:
    """Base class for all data points."""
    def __init__(self, **kwargs):
        self.__dict__.update(kwargs)

class NumericDataPoint(DataPoint):
    pass

class TextDataPoint(DataPoint):
    pass

class ComplexDataPoint(DataPoint):
    def __init__(self, value, meta=None, **kwargs):
        super().__init__(value=value, meta=meta, **kwargs)

class DeepDataPoint(NumericDataPoint):
    pass
from cognee.modules.graph.utils.convert_node_to_data_point import \
    convert_node_to_data_point

# --- End: Function to test ---

# --- Begin: Unit tests ---

# ----------------------- BASIC TEST CASES -----------------------





def test_missing_type_key():
    """Test when 'type' key is missing in node_data."""
    node = {"value": 123}
    with pytest.raises(KeyError):
        convert_node_to_data_point(node) # 1.05μs -> 988ns (6.28% faster)

def test_unknown_type():
    """Test when 'type' does not map to any subclass."""
    node = {"type": "NonExistentDataPoint", "value": 55}
    # find_subclass_by_name returns None, so subclass is None.
    # Attempting to call None(**node_data) should raise TypeError.
    with pytest.raises(TypeError):
        convert_node_to_data_point(node) # 19.8μs -> 2.26μs (775% faster)

def test_type_is_base_class():
    """Test when 'type' is the base class name (should not find subclass)."""
    node = {"type": "DataPoint", "foo": "bar"}
    # DataPoint is not a subclass of itself, so find_subclass_by_name returns None.
    with pytest.raises(TypeError):
        convert_node_to_data_point(node) # 14.1μs -> 1.74μs (712% faster)


def test_missing_required_init_arg():
    """Test node_data missing required __init__ argument for subclass."""
    node = {"type": "WeirdDataPoint", "value": 999}
    # WeirdDataPoint expects 'strange' argument, not provided.
    with pytest.raises(TypeError):
        convert_node_to_data_point(node) # 20.1μs -> 2.40μs (739% faster)

def test_type_case_sensitivity():
    """Test that type matching is case-sensitive."""
    node = {"type": "numericdatapoint", "value": 123}
    # Should not match NumericDataPoint due to case sensitivity.
    with pytest.raises(TypeError):
        convert_node_to_data_point(node) # 14.4μs -> 1.88μs (667% faster)

def test_type_is_empty_string():
    """Test when 'type' is an empty string."""
    node = {"type": "", "value": 0}
    # No subclass named ''.
    with pytest.raises(TypeError):
        convert_node_to_data_point(node) # 13.8μs -> 1.70μs (712% faster)

def test_type_is_none():
    """Test when 'type' is None."""
    node = {"type": None, "value": 0}
    # No subclass named None.
    with pytest.raises(TypeError):
        convert_node_to_data_point(node) # 13.7μs -> 1.82μs (653% faster)

def test_type_is_int():
    """Test when 'type' is an integer (should fail)."""
    node = {"type": 123, "value": 0}
    # No subclass named 123.
    with pytest.raises(TypeError):
        convert_node_to_data_point(node) # 13.7μs -> 1.79μs (667% faster)






#------------------------------------------------
import pytest
from cognee.modules.graph.utils.convert_node_to_data_point import \
    convert_node_to_data_point

# --- Function to test and minimal dependencies ---

# Minimal DataPoint class hierarchy for testing
class DataPoint:
    def __init__(self, **kwargs):
        self.__dict__.update(kwargs)

class TemperatureDataPoint(DataPoint):
    pass

class HumidityDataPoint(DataPoint):
    pass

class NestedDataPoint(DataPoint):
    pass

class LargeDataPoint(DataPoint):
    pass
from cognee.modules.graph.utils.convert_node_to_data_point import \
    convert_node_to_data_point

# --- Unit tests ---

# --------------------- BASIC TEST CASES ---------------------





def test_edge_type_case_sensitivity():
    # Test that type name is case-sensitive
    node = {"type": "temperaturedatapoint", "value": 23}
    with pytest.raises(TypeError):
        # NoneType is not callable, so this should fail
        convert_node_to_data_point(node) # 20.5μs -> 2.31μs (788% faster)

def test_edge_unknown_type():
    # Test with a type that does not exist
    node = {"type": "NonExistentDataPoint", "value": 1}
    with pytest.raises(TypeError):
        convert_node_to_data_point(node) # 14.2μs -> 1.85μs (668% faster)

def test_edge_type_is_none():
    # Test with type set to None
    node = {"type": None, "value": 1}
    with pytest.raises(TypeError):
        convert_node_to_data_point(node) # 13.9μs -> 1.77μs (684% faster)

def test_edge_type_is_empty_string():
    # Test with type as empty string
    node = {"type": "", "value": 1}
    with pytest.raises(TypeError):
        convert_node_to_data_point(node) # 13.6μs -> 1.73μs (685% faster)

def test_edge_missing_type_key():
    # Test with missing 'type' key
    node = {"value": 1}
    with pytest.raises(KeyError):
        convert_node_to_data_point(node) # 902ns -> 817ns (10.4% faster)

def test_edge_type_is_not_string():
    # Test with type as a non-string (e.g., int)
    node = {"type": 123, "value": 1}
    with pytest.raises(TypeError):
        convert_node_to_data_point(node) # 15.2μs -> 1.88μs (707% faster)








def test_large_string_type_field():
    # Test with a very large string in the 'type' field (should fail)
    node = {"type": "T" * 1000, "value": 1}
    with pytest.raises(TypeError):
        convert_node_to_data_point(node) # 20.1μs -> 2.23μs (800% faster)
# codeflash_output is used to check that the output of the original code is the same as that of the optimized code.

To edit these changes git checkout codeflash/optimize-convert_node_to_data_point-mh14i7e1 and push.

Codeflash

The optimization introduces a **module-level cache** (`_SUBCLASS_CACHE`) that eliminates the expensive repeated traversal of the class hierarchy. 

**Key Performance Problem**: The original code called `get_all_subclasses(cls)` on every invocation of `find_subclass_by_name`, which recursively walks the entire subclass tree. The line profiler shows this accounts for 78.8% of the execution time (1,155 hits calling `get_all_subclasses`).

**Optimization Strategy**: 
- **Lazy caching**: Build a dictionary mapping class names to subclasses only once per base class
- **O(1) lookup**: Replace linear search through subclasses with dictionary lookup using `cache.get(name, None)`

**Performance Impact**:
- Original: 1,155 calls to iterate through subclasses (78.8% of time)
- Optimized: Only 1 call to `get_all_subclasses` to build the cache, then O(1) dictionary lookups

**Test Case Performance**: The optimization is most effective for scenarios with **unknown or invalid type names** (667-800% speedup), where the original code would traverse the entire subclass hierarchy before returning `None`. Valid type lookups also benefit significantly from the O(1) dictionary access pattern.

This caching approach scales particularly well when the same base class is used repeatedly, as subsequent calls avoid the expensive recursive traversal entirely.
@codeflash-ai codeflash-ai bot requested a review from mashraf-222 October 21, 2025 22:15
@codeflash-ai codeflash-ai bot added the ⚡️ codeflash Optimization PR opened by Codeflash AI label Oct 21, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

⚡️ codeflash Optimization PR opened by Codeflash AI

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant