Skip to content

Conversation

@codeflash-ai
Copy link

@codeflash-ai codeflash-ai bot commented Nov 11, 2025

📄 824% (8.24x) speedup for get_default_tasks in cognee/api/v1/cognify/cognify.py

⏱️ Runtime : 57.3 milliseconds 6.21 milliseconds (best of 145 runs)

📝 Explanation and details

The optimized code achieves an 823% speedup by strategically applying @lru_cache decorators to eliminate redundant expensive object instantiations and computations.

Key Optimizations Applied:

  1. Cached get_default_ontology_resolver() - Added @lru_cache to prevent repeated creation of RDFLibOntologyResolver objects. The line profiler shows this function was called 633 times, taking 0.33 seconds total (517μs per call). Since the resolver is stateless and always returns the same configuration, caching eliminates all but the first instantiation.

  2. Cached get_max_chunk_tokens() - Added @lru_cache to memoize expensive vector engine and LLM client initialization. This function consumed 1.47 seconds (99.5% from get_vector_engine().embedding_engine) across 62 calls. Caching reduces this to a one-time cost.

Why This Leads to Major Speedup:

The line profiler reveals that get_default_ontology_resolver() consumed 18% of total runtime in get_default_tasks(), while get_max_chunk_tokens() consumed 80.1%. These functions perform expensive I/O operations and object instantiations that don't change between calls. By caching their results, the optimized version eliminates nearly all of this computational overhead on subsequent calls.

Impact on Workloads:

The 110% throughput improvement (43,884 → 92,220 ops/sec) makes this optimization particularly valuable for high-frequency scenarios where get_default_tasks() is called repeatedly, such as processing multiple document batches or handling concurrent API requests. The test results show consistent speedups across all load scenarios, from small (10 calls) to medium (100+ calls) workloads, indicating the caching strategy scales well with increased usage patterns.

Correctness verification report:

Test Status
⚙️ Existing Unit Tests 🔘 None Found
🌀 Generated Regression Tests 466 Passed
⏪ Replay Tests 🔘 None Found
🔎 Concolic Coverage Tests 🔘 None Found
📊 Tests Coverage 85.7%
🌀 Generated Regression Tests and Runtime

import asyncio # used to run async functions

import pytest # used for our unit tests
from cognee.api.v1.cognify.cognify import get_default_tasks
from cognee.modules.chunking.TextChunker import TextChunker

We'll need to stub/mock the Task class to check its instantiation and arguments.

from cognee.modules.pipelines.tasks.task import Task
from cognee.modules.users.models import User
from cognee.shared.data_models import KnowledgeGraph

Helper to extract function from Task object for assertions

def get_task_func_names(tasks):
return [t.func.name for t in tasks]

Helper to extract task_config from Task object for assertions

def get_task_config(tasks):
return [getattr(t, "task_config", None) for t in tasks]

--- Basic Test Cases ---

@pytest.mark.asyncio
async def test_get_default_tasks_basic_returns_list_of_tasks():
"""
Basic: Ensure get_default_tasks returns a list of Task objects with expected functions.
"""
tasks = await get_default_tasks()
# Check all are Task instances
for t in tasks:
pass
# Check expected function names in order
expected_funcs = [
"classify_documents",
"check_permissions_on_dataset",
"extract_chunks_from_documents",
"extract_graph_from_data",
"summarize_text",
"add_data_points"
]

@pytest.mark.asyncio
async def test_get_default_tasks_basic_async_await_behavior():
"""
Basic: Awaiting the function should work and return tasks.
"""
# Ensure await works and result is correct type
tasks = await get_default_tasks()

@pytest.mark.asyncio
async def test_get_default_tasks_basic_default_arguments():
"""
Basic: Calling with no arguments should use defaults.
"""
tasks = await get_default_tasks()
# Check that the third task uses TextChunker by default
chunker_task = tasks[2]
# Check that the fourth task uses KnowledgeGraph by default
graph_task = tasks[3]

--- Edge Test Cases ---

@pytest.mark.asyncio

async def test_get_default_tasks_edge_with_custom_chunker_and_custom_prompt():
"""
Edge: Pass a custom chunker and custom_prompt, ensure they're propagated to the correct task.
"""
class DummyChunker:
pass
custom_prompt = "Extract only named entities."
tasks = await get_default_tasks(chunker=DummyChunker, custom_prompt=custom_prompt)
chunk_task = tasks[2]
graph_task = tasks[3]

@pytest.mark.asyncio

async def test_get_default_tasks_edge_invalid_config_type_raises():
"""
Edge: Pass an invalid config type (non-dict), should not break function, but config should be used as-is.
"""
bad_config = object()
tasks = await get_default_tasks(config=bad_config)
graph_task = tasks[3]

--- Large Scale Test Cases ---

@pytest.mark.asyncio
async def test_get_default_tasks_large_scale_many_concurrent_calls():
"""
Large Scale: Run many concurrent calls to get_default_tasks and ensure all results are correct.
"""
# 50 concurrent calls with different chunk_sizes
chunk_sizes = list(range(100, 150))
coros = [get_default_tasks(chunk_size=cs) for cs in chunk_sizes]
results = await asyncio.gather(*coros)
# Each result should have the correct chunk_size in the third task
for idx, tasks in enumerate(results):
pass

@pytest.mark.asyncio

async def test_get_default_tasks_throughput_small_load():
"""
Throughput: Ensure function performance and correctness under small load.
"""
# 10 small concurrent calls
coros = [get_default_tasks(chunk_size=i) for i in range(10)]
results = await asyncio.gather(*coros)
for i, tasks in enumerate(results):
pass

@pytest.mark.asyncio
async def test_get_default_tasks_throughput_medium_load():
"""
Throughput: Ensure function performance and correctness under medium load.
"""
# 100 concurrent calls
coros = [get_default_tasks(chunk_size=i) for i in range(100, 200)]
results = await asyncio.gather(*coros)
for i, tasks in enumerate(results):
pass

@pytest.mark.asyncio

async def test_get_default_tasks_edge_return_value_is_new_list_each_call():
"""
Edge: Ensure each call returns a new list object, not the same reference.
"""
tasks1 = await get_default_tasks()
tasks2 = await get_default_tasks()

--- Edge Case: Check that all tasks have batch_size=10 in task_config where appropriate ---

@pytest.mark.asyncio
async def test_get_default_tasks_edge_task_config_batch_size():
"""
Edge: Ensure tasks that should have batch_size=10 in task_config do so.
"""
tasks = await get_default_tasks()
# Only tasks 3-6 have task_config
for idx in range(3, 6):
tc = get_task_config(tasks)[idx]

--- Edge Case: Check that permissions argument is always ["write"] in check_permissions_on_dataset task ---

@pytest.mark.asyncio
async def test_get_default_tasks_edge_permissions_argument():
"""
Edge: Ensure permissions argument is always ['write'] for check_permissions_on_dataset task.
"""
tasks = await get_default_tasks()
check_perm_task = tasks[1]

--- Edge Case: Check that the function is robust to None arguments ---

@pytest.mark.asyncio
async def test_get_default_tasks_edge_none_arguments():
"""
Edge: Pass None for all arguments and ensure function still works.
"""
tasks = await get_default_tasks(user=None, graph_model=None, chunker=None, chunk_size=None, config=None, custom_prompt=None)

--- Edge Case: Check that custom_prompt is propagated only to extract_graph_from_data task ---

@pytest.mark.asyncio
async def test_get_default_tasks_edge_custom_prompt_only_for_graph_task():
"""
Edge: Ensure custom_prompt is only propagated to extract_graph_from_data task.
"""
custom_prompt = "Prompt for graph extraction"
tasks = await get_default_tasks(custom_prompt=custom_prompt)
# Only the 4th task (index 3) should have custom_prompt
for idx, t in enumerate(tasks):
if idx == 3:
pass
else:
pass

--- Edge Case: Check that classify_documents task has no kwargs or task_config ---

@pytest.mark.asyncio
async def test_get_default_tasks_edge_classify_documents_task_has_no_kwargs_or_task_config():
"""
Edge: Ensure classify_documents task has no kwargs or task_config.
"""
tasks = await get_default_tasks()
classify_task = tasks[0]

codeflash_output is used to check that the output of the original code is the same as that of the optimized code.

#------------------------------------------------
import asyncio

import pytest
from cognee.api.v1.cognify.cognify import get_default_tasks

Import TextChunker for default chunker

from cognee.modules.chunking.TextChunker import TextChunker

Import Config for config parameter

from cognee.modules.ontology.ontology_config import Config

Import Task for type checking and introspection

from cognee.modules.pipelines.tasks.task import Task

Import User for user parameter

from cognee.modules.users.models import User

Import KnowledgeGraph for default graph_model

from cognee.shared.data_models import KnowledgeGraph

--- Basic Test Cases ---

@pytest.mark.asyncio
async def test_get_default_tasks_basic_returns_list_of_tasks():
"""
Basic: Test that get_default_tasks returns a list of Task objects with default parameters.
"""
result = await get_default_tasks()

@pytest.mark.asyncio
async def test_get_default_tasks_basic_awaitable_behavior():
"""
Basic: Test that get_default_tasks is awaitable and returns the same result on repeated calls.
"""
result1 = await get_default_tasks()
result2 = await get_default_tasks()

@pytest.mark.asyncio

async def test_get_default_tasks_basic_with_custom_chunk_size():
"""
Basic: Test that passing a custom chunk_size sets the correct argument in extract_chunks_from_documents Task.
"""
custom_chunk_size = 123
tasks = await get_default_tasks(chunk_size=custom_chunk_size)
# The third task is extract_chunks_from_documents
chunk_task = tasks[2]

@pytest.mark.asyncio
async def test_get_default_tasks_basic_with_custom_graph_model():
"""
Basic: Test that passing a custom graph_model sets the correct argument in extract_graph_from_data Task.
"""
class DummyGraphModel:
pass
tasks = await get_default_tasks(graph_model=DummyGraphModel)
graph_task = tasks[3]

@pytest.mark.asyncio
async def test_get_default_tasks_basic_with_custom_chunker():
"""
Basic: Test that passing a custom chunker sets the correct argument in extract_chunks_from_documents Task.
"""
class DummyChunker:
pass
tasks = await get_default_tasks(chunker=DummyChunker)
chunk_task = tasks[2]

@pytest.mark.asyncio
async def test_get_default_tasks_basic_with_custom_config():
"""
Basic: Test that passing a custom config sets the correct argument in extract_graph_from_data Task.
"""
dummy_config = {"ontology_config": {"ontology_resolver": "dummy"}}
tasks = await get_default_tasks(config=dummy_config)
graph_task = tasks[3]

@pytest.mark.asyncio
async def test_get_default_tasks_basic_with_custom_prompt():
"""
Basic: Test that passing a custom_prompt sets the correct argument in extract_graph_from_data Task.
"""
custom_prompt = "Extract this graph"
tasks = await get_default_tasks(custom_prompt=custom_prompt)
graph_task = tasks[3]

--- Edge Test Cases ---

@pytest.mark.asyncio
async def test_get_default_tasks_edge_empty_config_env():
"""
Edge: Test that get_default_tasks works when config is None and ontology_env_config is empty.
"""
# This test is valid if the environment has no .env file or the ontology config is empty.
# The function should still return 6 tasks.
tasks = await get_default_tasks(config=None)

@pytest.mark.asyncio

async def test_get_default_tasks_edge_invalid_graph_model_type():
"""
Edge: Test that passing a non-type as graph_model still sets it in the Task (the function does not validate).
"""
graph_model = "not_a_type"
tasks = await get_default_tasks(graph_model=graph_model)
graph_task = tasks[3]

@pytest.mark.asyncio
async def test_get_default_tasks_edge_invalid_chunker_type():
"""
Edge: Test that passing a non-type as chunker still sets it in the Task (the function does not validate).
"""
chunker = "not_a_type"
tasks = await get_default_tasks(chunker=chunker)
chunk_task = tasks[2]

@pytest.mark.asyncio

async def test_get_default_tasks_large_scale_different_graph_models():
"""
Large Scale: Test concurrent calls with different custom graph_model types.
"""
class GraphModelA: pass
class GraphModelB: pass
coros = [
get_default_tasks(graph_model=GraphModelA),
get_default_tasks(graph_model=GraphModelB),
get_default_tasks(graph_model=KnowledgeGraph),
]
results = await asyncio.gather(*coros)

--- Throughput Test Cases ---

@pytest.mark.asyncio
async def test_get_default_tasks_throughput_small_load():
"""
Throughput: Test get_default_tasks under small load (10 concurrent calls).
"""
coros = [get_default_tasks() for _ in range(10)]
results = await asyncio.gather(*coros)

@pytest.mark.asyncio
async def test_get_default_tasks_throughput_medium_load():
"""
Throughput: Test get_default_tasks under medium load (50 concurrent calls).
"""
coros = [get_default_tasks(chunk_size=100 + i) for i in range(50)]
results = await asyncio.gather(*coros)
# Spot check chunk_size propagation
for i, tasks in enumerate(results):
pass

@pytest.mark.asyncio

async def test_get_default_tasks_throughput_batch_consistency():
"""
Throughput: Test that all returned Task objects are unique instances per call under load.
"""
coros = [get_default_tasks() for _ in range(20)]
results = await asyncio.gather(*coros)
# Each result should be a list of Task objects, and no Task object should be shared between calls
all_task_objs = [task for tasks in results for task in tasks]
# Check that all Task objects are unique (by id)
task_ids = [id(task) for task in all_task_objs]

codeflash_output is used to check that the output of the original code is the same as that of the optimized code.

To edit these changes git checkout codeflash/optimize-get_default_tasks-mhttcq3d and push.

Codeflash Static Badge

The optimized code achieves an **823% speedup** by strategically applying `@lru_cache` decorators to eliminate redundant expensive object instantiations and computations.

**Key Optimizations Applied:**

1. **Cached `get_default_ontology_resolver()`** - Added `@lru_cache` to prevent repeated creation of `RDFLibOntologyResolver` objects. The line profiler shows this function was called 633 times, taking 0.33 seconds total (517μs per call). Since the resolver is stateless and always returns the same configuration, caching eliminates all but the first instantiation.

2. **Cached `get_max_chunk_tokens()`** - Added `@lru_cache` to memoize expensive vector engine and LLM client initialization. This function consumed 1.47 seconds (99.5% from `get_vector_engine().embedding_engine`) across 62 calls. Caching reduces this to a one-time cost.

**Why This Leads to Major Speedup:**

The line profiler reveals that `get_default_ontology_resolver()` consumed 18% of total runtime in `get_default_tasks()`, while `get_max_chunk_tokens()` consumed 80.1%. These functions perform expensive I/O operations and object instantiations that don't change between calls. By caching their results, the optimized version eliminates nearly all of this computational overhead on subsequent calls.

**Impact on Workloads:**

The **110% throughput improvement** (43,884 → 92,220 ops/sec) makes this optimization particularly valuable for high-frequency scenarios where `get_default_tasks()` is called repeatedly, such as processing multiple document batches or handling concurrent API requests. The test results show consistent speedups across all load scenarios, from small (10 calls) to medium (100+ calls) workloads, indicating the caching strategy scales well with increased usage patterns.
@codeflash-ai codeflash-ai bot requested a review from mashraf-222 November 11, 2025 00:09
@codeflash-ai codeflash-ai bot added ⚡️ codeflash Optimization PR opened by Codeflash AI 🎯 Quality: High Optimization Quality according to Codeflash labels Nov 11, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

⚡️ codeflash Optimization PR opened by Codeflash AI 🎯 Quality: High Optimization Quality according to Codeflash

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant