⚡️ Speed up function `get_source_code_files` by 359% #31

codeflash-ai · 2025-10-21T20:49:45Z

📄 359% (3.59x) speedup for `get_source_code_files` in `cognee/tasks/repo_processor/get_repo_file_dependencies.py`

⏱️ Runtime : 275 milliseconds → 60.0 milliseconds (best of 67 runs)

📝 Explanation and details

The optimized code achieves a 358% speedup (275ms → 60ms) and 148% throughput improvement (3645 → 9045 ops/sec) through two key optimizations:

1. Pre-computed Extension-to-Language Lookup Map

Replaces nested loop language detection with O(1) dictionary lookup using ext_to_lang[_ext]
Eliminates the expensive _get_language_from_extension() function calls that were consuming 5.7% of total runtime
Line profiler shows this reduces per-file language detection from ~4894ns to ~303ns per hit

2. Directory-Level Exclusion Filtering

Moves directory exclusion checks (EXCLUDED_DIRS & root_parts and excluded_paths matching) outside the file loop
Now checks once per directory rather than once per file, dramatically reducing redundant Path.resolve() calls
The original code was calling Path(root).resolve() for every single file (66.1% of runtime), now called only once per directory
Processes excluded paths set once upfront instead of recreating it per file

Performance Impact by Test Type:

Small repositories (5-20 files): Moderate gains from reduced per-file overhead
Medium repositories (50-200 files): Significant gains as directory-level filtering eliminates most redundant path operations
Large/nested repositories: Maximum benefit as the optimizations scale linearly with file count rather than quadratically

The optimizations are particularly effective for repositories with many files per directory and multiple excluded directories, which are common in real-world codebases with build artifacts, dependencies, and test directories.

✅ Correctness verification report:

Test	Status
⚙️ Existing Unit Tests	🔘 None Found
🌀 Generated Regression Tests	✅ 135 Passed
⏪ Replay Tests	🔘 None Found
🔎 Concolic Coverage Tests	🔘 None Found
📊 Tests Coverage	100.0%

🌀 Generated Regression Tests and Runtime

import asyncio  # used to run async functions
# function to test
# --- BEGIN FUNCTION UNDER TEST ---
import os
import shutil
import tempfile
from pathlib import Path
from typing import List, Optional, Set

import pytest
from cognee.tasks.repo_processor.get_repo_file_dependencies import \
    get_source_code_files

EXCLUDED_DIRS: Set[str] = {
    ".venv",
    "venv",
    "env",
    ".env",
    "site-packages",
    "node_modules",
    "dist",
    "build",
    ".git",
    "tests",
    "test",
}
from cognee.tasks.repo_processor.get_repo_file_dependencies import \
    get_source_code_files  # --- END FUNCTION UNDER TEST ---

# ----------- UNIT TESTS START HERE -----------

@pytest.mark.asyncio
async def test_get_source_code_files_basic_python_file(tmp_path):
    # Create a simple Python file
    file = tmp_path / "main.py"
    file.write_text("print('hello')")
    # Await the async function
    result = await get_source_code_files(str(tmp_path))

@pytest.mark.asyncio
async def test_get_source_code_files_basic_multiple_languages(tmp_path):
    # Create files for multiple languages
    py_file = tmp_path / "foo.py"
    js_file = tmp_path / "bar.js"
    ts_file = tmp_path / "baz.ts"
    py_file.write_text("print(1)")
    js_file.write_text("console.log(1)")
    ts_file.write_text("let x: number = 1;")
    result = await get_source_code_files(str(tmp_path))
    # All three should be detected with correct language
    paths = {r[0]: r[1] for r in result}

@pytest.mark.asyncio
async def test_get_source_code_files_basic_no_source_files(tmp_path):
    # No source code files present
    (tmp_path / "README.md").write_text("hello")
    (tmp_path / "data.txt").write_text("data")
    result = await get_source_code_files(str(tmp_path))

@pytest.mark.asyncio
async def test_get_source_code_files_basic_empty_file(tmp_path):
    # Create an empty .py file (should be excluded)
    empty_file = tmp_path / "empty.py"
    empty_file.write_text("")
    result = await get_source_code_files(str(tmp_path))

@pytest.mark.asyncio
async def test_get_source_code_files_basic_test_file_exclusion(tmp_path):
    # Should exclude test files by name pattern
    (tmp_path / "test_utils.py").write_text("def foo(): pass")
    (tmp_path / "foo_test.py").write_text("def bar(): pass")
    (tmp_path / "foo.spec.js").write_text("console.log(1)")
    (tmp_path / "foo.test.ts").write_text("let x = 1;")
    (tmp_path / "main.py").write_text("print('main')")
    result = await get_source_code_files(str(tmp_path))

@pytest.mark.asyncio
async def test_get_source_code_files_basic_excluded_dirs(tmp_path):
    # Files in excluded dirs should not be found
    venv_dir = tmp_path / "venv"
    venv_dir.mkdir()
    (venv_dir / "foo.py").write_text("print('venv')")
    (tmp_path / "main.py").write_text("print('main')")
    result = await get_source_code_files(str(tmp_path))

@pytest.mark.asyncio
async def test_get_source_code_files_basic_custom_language_config(tmp_path):
    # Use a custom language config
    (tmp_path / "foo.bar").write_text("custom")
    config = {"customlang": [".bar"]}
    result = await get_source_code_files(str(tmp_path), language_config=config)

@pytest.mark.asyncio
async def test_get_source_code_files_basic_excluded_paths(tmp_path):
    # Exclude a specific subdirectory using excluded_paths
    subdir = tmp_path / "sub"
    subdir.mkdir()
    (subdir / "foo.py").write_text("print('sub')")
    (tmp_path / "main.py").write_text("print('main')")
    result = await get_source_code_files(str(tmp_path), excluded_paths=[str(subdir)])

@pytest.mark.asyncio
async def test_get_source_code_files_edge_nonexistent_path():
    # Non-existent path should return []
    result = await get_source_code_files("/this/path/does/not/exist")

@pytest.mark.asyncio
async def test_get_source_code_files_edge_symlinked_excluded_dir(tmp_path):
    # Symlink to an excluded dir should be excluded
    real_dir = tmp_path / "real_env"
    real_dir.mkdir()
    (real_dir / "foo.py").write_text("print('env')")
    symlink_dir = tmp_path / "venv"
    symlink_dir.symlink_to(real_dir, target_is_directory=True)
    (tmp_path / "main.py").write_text("print('main')")
    result = await get_source_code_files(str(tmp_path))

@pytest.mark.asyncio
async def test_get_source_code_files_edge_files_with_similar_names(tmp_path):
    # Should only exclude files matching the test patterns, not similar names
    (tmp_path / "testutils.py").write_text("print('not a test')")
    (tmp_path / "main.py").write_text("print('main')")
    result = await get_source_code_files(str(tmp_path))
    # Both files should be included
    files = {os.path.basename(r[0]) for r in result}

@pytest.mark.asyncio
async def test_get_source_code_files_edge_excluded_paths_partial_match(tmp_path):
    # Excluded paths must match full path, not partial
    subdir = tmp_path / "sub"
    subdir.mkdir()
    (subdir / "foo.py").write_text("print('sub')")
    (tmp_path / "main.py").write_text("print('main')")
    # Exclude a path that is NOT a parent of subdir
    result = await get_source_code_files(str(tmp_path), excluded_paths=[str(tmp_path / "other")])
    # Both files should be included
    files = {os.path.basename(r[0]) for r in result}

@pytest.mark.asyncio
async def test_get_source_code_files_edge_case_sensitive(tmp_path):
    # Extension matching should be case sensitive
    (tmp_path / "foo.PY").write_text("print('should not match')")
    (tmp_path / "bar.py").write_text("print('should match')")
    result = await get_source_code_files(str(tmp_path))
    files = {os.path.basename(r[0]) for r in result}

@pytest.mark.asyncio
async def test_get_source_code_files_edge_subdir_excluded_dir(tmp_path):
    # If a file is in a nested excluded dir, it should not be included
    test_dir = tmp_path / "tests"
    test_dir.mkdir()
    (test_dir / "foo.py").write_text("print('test')")
    (tmp_path / "main.py").write_text("print('main')")
    result = await get_source_code_files(str(tmp_path))
    files = {os.path.basename(r[0]) for r in result}

@pytest.mark.asyncio
async def test_get_source_code_files_edge_concurrent_calls(tmp_path):
    # Test concurrent execution on different dirs
    dir1 = tmp_path / "repo1"
    dir2 = tmp_path / "repo2"
    dir1.mkdir()
    dir2.mkdir()
    (dir1 / "a.py").write_text("print('a')")
    (dir2 / "b.js").write_text("console.log('b')")
    # Concurrently call the function
    results = await asyncio.gather(
        get_source_code_files(str(dir1)),
        get_source_code_files(str(dir2)),
    )
    # Each result should contain only the correct file
    files1 = {os.path.basename(r[0]) for r in results[0]}
    files2 = {os.path.basename(r[0]) for r in results[1]}

@pytest.mark.asyncio
async def test_get_source_code_files_large_scale_many_files(tmp_path):
    # Create a directory with many files (but <1000)
    n = 200
    files = []
    for i in range(n):
        f = tmp_path / f"file_{i}.py"
        f.write_text("print('x')")
        files.append(f)
    result = await get_source_code_files(str(tmp_path))
    result_files = {os.path.basename(r[0]) for r in result}
    # All should be present
    for i in range(n):
        pass

@pytest.mark.asyncio
async def test_get_source_code_files_large_scale_nested_dirs(tmp_path):
    # Create nested directories with files
    levels = 5
    base = tmp_path
    expected = set()
    for i in range(levels):
        base = base / f"level_{i}"
        base.mkdir()
        f = base / f"file_{i}.py"
        f.write_text("print('nested')")
        expected.add(str(f.resolve()))
    result = await get_source_code_files(str(tmp_path))
    found = {r[0] for r in result}

@pytest.mark.asyncio
async def test_get_source_code_files_large_scale_concurrent(tmp_path):
    # Concurrently scan the same directory multiple times
    for i in range(10):
        (tmp_path / f"file_{i}.js").write_text("console.log('x')")
    coros = [get_source_code_files(str(tmp_path)) for _ in range(5)]
    results = await asyncio.gather(*coros)
    for res in results:
        files = {os.path.basename(r[0]) for r in res}
        for i in range(10):
            pass

@pytest.mark.asyncio
async def test_get_source_code_files_throughput_small_load(tmp_path):
    # Throughput: small number of files, many concurrent calls
    for i in range(3):
        (tmp_path / f"foo_{i}.py").write_text("print('x')")
    coros = [get_source_code_files(str(tmp_path)) for _ in range(10)]
    results = await asyncio.gather(*coros)
    for res in results:
        pass

@pytest.mark.asyncio
async def test_get_source_code_files_throughput_medium_load(tmp_path):
    # Throughput: medium number of files, moderate concurrency
    for i in range(50):
        (tmp_path / f"bar_{i}.ts").write_text("let x = 1;")
    coros = [get_source_code_files(str(tmp_path)) for _ in range(20)]
    results = await asyncio.gather(*coros)
    for res in results:
        pass

@pytest.mark.asyncio
async def test_get_source_code_files_throughput_large_load(tmp_path):
    # Throughput: large number of files, high concurrency (but <1000)
    for i in range(200):
        (tmp_path / f"baz_{i}.cpp").write_text("int main() { return 0; }")
    coros = [get_source_code_files(str(tmp_path)) for _ in range(50)]
    results = await asyncio.gather(*coros)
    for res in results:
        pass
# codeflash_output is used to check that the output of the original code is the same as that of the optimized code.
#------------------------------------------------
import asyncio  # used to run async functions
import os
import shutil
import tempfile
from pathlib import Path

import pytest  # used for our unit tests
from cognee.tasks.repo_processor.get_repo_file_dependencies import \
    get_source_code_files

# function to test
# --- Begin: cognee/tasks/repo_processor/get_repo_file_dependencies.py ---
EXCLUDED_DIRS = {
    ".venv",
    "venv",
    "env",
    ".env",
    "site-packages",
    "node_modules",
    "dist",
    "build",
    ".git",
    "tests",
    "test",
}
from cognee.tasks.repo_processor.get_repo_file_dependencies import \
    get_source_code_files  # --- End: cognee/tasks/repo_processor/get_repo_file_dependencies.py ---

# ------------------ UNIT TESTS ------------------

@pytest.fixture
def temp_repo():
    """Create a temporary directory for the repo and clean up after."""
    repo_dir = tempfile.mkdtemp()
    yield repo_dir
    shutil.rmtree(repo_dir)

@pytest.fixture
def make_file():
    """Helper to create a file with content."""
    def _make_file(path, content="print('hello')"):
        os.makedirs(os.path.dirname(path), exist_ok=True)
        with open(path, "w") as f:
            f.write(content)
        return path
    return _make_file

# -------------- BASIC TEST CASES --------------

@pytest.mark.asyncio
async def test_get_source_code_files_basic_python(temp_repo, make_file):
    # Create a simple python file
    py_file = os.path.join(temp_repo, "main.py")
    make_file(py_file, "print('hello')")
    # Should find the file and return its absolute path and language
    result = await get_source_code_files(temp_repo)

@pytest.mark.asyncio
async def test_get_source_code_files_basic_multiple_languages(temp_repo, make_file):
    # Create files for multiple languages
    js_file = make_file(os.path.join(temp_repo, "app.js"), "console.log('hi')")
    py_file = make_file(os.path.join(temp_repo, "main.py"), "print('hello')")
    ts_file = make_file(os.path.join(temp_repo, "types.ts"), "type X = number;")
    result = await get_source_code_files(temp_repo)
    expected = sorted([
        (os.path.abspath(js_file), "javascript"),
        (os.path.abspath(py_file), "python"),
        (os.path.abspath(ts_file), "typescript"),
    ])

@pytest.mark.asyncio
async def test_get_source_code_files_basic_empty_directory(temp_repo):
    # No files in repo
    result = await get_source_code_files(temp_repo)

@pytest.mark.asyncio
async def test_get_source_code_files_basic_nonexistent_path():
    # Path does not exist
    result = await get_source_code_files("/path/does/not/exist")

@pytest.mark.asyncio
async def test_get_source_code_files_basic_excluded_test_files(temp_repo, make_file):
    # Should exclude test files by name
    test_py = make_file(os.path.join(temp_repo, "test_utils.py"), "print('test')")
    normal_py = make_file(os.path.join(temp_repo, "main.py"), "print('main')")
    spec_js = make_file(os.path.join(temp_repo, "foo.spec.js"), "console.log('spec')")
    result = await get_source_code_files(temp_repo)

@pytest.mark.asyncio
async def test_get_source_code_files_basic_excluded_empty_files(temp_repo, make_file):
    # Should exclude files of size 0
    empty_py = os.path.join(temp_repo, "empty.py")
    make_file(empty_py, "")
    normal_py = make_file(os.path.join(temp_repo, "main.py"), "print('main')")
    result = await get_source_code_files(temp_repo)

# -------------- EDGE TEST CASES --------------

@pytest.mark.asyncio
async def test_get_source_code_files_edge_excluded_dirs(temp_repo, make_file):
    # Should exclude files in EXCLUDED_DIRS
    venv_dir = os.path.join(temp_repo, "venv")
    os.makedirs(venv_dir)
    venv_file = make_file(os.path.join(venv_dir, "foo.py"), "print('venv')")
    src_file = make_file(os.path.join(temp_repo, "src.py"), "print('src')")
    result = await get_source_code_files(temp_repo)

@pytest.mark.asyncio
async def test_get_source_code_files_edge_excluded_paths(temp_repo, make_file):
    # Should exclude files in excluded_paths argument
    subdir = os.path.join(temp_repo, "ignoreme")
    os.makedirs(subdir)
    ignored_file = make_file(os.path.join(subdir, "foo.py"), "print('ignored')")
    included_file = make_file(os.path.join(temp_repo, "main.py"), "print('main')")
    result = await get_source_code_files(temp_repo, excluded_paths=[subdir])

@pytest.mark.asyncio
async def test_get_source_code_files_edge_custom_language_config(temp_repo, make_file):
    # Custom language config should work
    config = {"python": [".py"], "customlang": [".foo"]}
    foo_file = make_file(os.path.join(temp_repo, "bar.foo"), "custom content")
    py_file = make_file(os.path.join(temp_repo, "baz.py"), "print('baz')")
    result = await get_source_code_files(temp_repo, language_config=config)
    expected = sorted([
        (os.path.abspath(foo_file), "customlang"),
        (os.path.abspath(py_file), "python"),
    ])

@pytest.mark.asyncio
async def test_get_source_code_files_edge_no_matching_files(temp_repo, make_file):
    # No files matching language config
    txt_file = make_file(os.path.join(temp_repo, "notes.txt"), "hello world")
    result = await get_source_code_files(temp_repo)

@pytest.mark.asyncio
async def test_get_source_code_files_edge_concurrent_calls(temp_repo, make_file):
    # Create files for concurrency test
    py_file = make_file(os.path.join(temp_repo, "main.py"), "print('main')")
    js_file = make_file(os.path.join(temp_repo, "app.js"), "console.log('app')")
    # Run two concurrent calls with different configs
    config1 = {"python": [".py"]}
    config2 = {"javascript": [".js"]}
    results = await asyncio.gather(
        get_source_code_files(temp_repo, language_config=config1),
        get_source_code_files(temp_repo, language_config=config2)
    )

@pytest.mark.asyncio
async def test_get_source_code_files_edge_nested_excluded_dirs(temp_repo, make_file):
    # Should exclude files in nested excluded dirs
    git_dir = os.path.join(temp_repo, ".git", "subdir")
    os.makedirs(git_dir)
    git_file = make_file(os.path.join(git_dir, "foo.py"), "print('git')")
    src_file = make_file(os.path.join(temp_repo, "src.py"), "print('src')")
    result = await get_source_code_files(temp_repo)

# -------------- LARGE SCALE TEST CASES --------------

@pytest.mark.asyncio
async def test_get_source_code_files_large_scale_many_files(temp_repo, make_file):
    # Create 100 python files and 50 js files
    py_files = [make_file(os.path.join(temp_repo, f"file_{i}.py"), f"print({i})") for i in range(100)]
    js_files = [make_file(os.path.join(temp_repo, f"file_{i}.js"), f"console.log({i})") for i in range(50)]
    result = await get_source_code_files(temp_repo)
    # Should find all files (100 + 50)
    file_paths = set(r[0] for r in result)
    expected_paths = set(map(os.path.abspath, py_files + js_files))
    # All languages should be correct
    langs = set(r[1] for r in result)

@pytest.mark.asyncio
async def test_get_source_code_files_large_scale_concurrent_many_calls(temp_repo, make_file):
    # Create 20 python files
    py_files = [make_file(os.path.join(temp_repo, f"file_{i}.py"), f"print({i})") for i in range(20)]
    # Run 10 concurrent calls
    coros = [get_source_code_files(temp_repo) for _ in range(10)]
    results = await asyncio.gather(*coros)
    for result in results:
        pass

# -------------- THROUGHPUT TEST CASES --------------

@pytest.mark.asyncio
async def test_get_source_code_files_throughput_small_load(temp_repo, make_file):
    # Small load: 5 files
    files = [make_file(os.path.join(temp_repo, f"small_{i}.py"), f"print({i})") for i in range(5)]
    result = await get_source_code_files(temp_repo)

@pytest.mark.asyncio
async def test_get_source_code_files_throughput_medium_load(temp_repo, make_file):
    # Medium load: 50 files
    files = [make_file(os.path.join(temp_repo, f"medium_{i}.py"), f"print({i})") for i in range(50)]
    result = await get_source_code_files(temp_repo)

@pytest.mark.asyncio
async def test_get_source_code_files_throughput_high_volume(temp_repo, make_file):
    # High volume: 200 files
    files = [make_file(os.path.join(temp_repo, f"high_{i}.py"), f"print({i})") for i in range(200)]
    result = await get_source_code_files(temp_repo)

@pytest.mark.asyncio
async def test_get_source_code_files_throughput_sustained_concurrent(temp_repo, make_file):
    # Create 30 files
    files = [make_file(os.path.join(temp_repo, f"sustained_{i}.py"), f"print({i})") for i in range(30)]
    # Run 5 concurrent calls, sustained pattern
    coros = [get_source_code_files(temp_repo) for _ in range(5)]
    results = await asyncio.gather(*coros)
    for result in results:
        pass
# codeflash_output is used to check that the output of the original code is the same as that of the optimized code.

To edit these changes git checkout codeflash/optimize-get_source_code_files-mh11film and push.

The optimized code achieves a **358% speedup** (275ms → 60ms) and **148% throughput improvement** (3645 → 9045 ops/sec) through two key optimizations: **1. Pre-computed Extension-to-Language Lookup Map** - Replaces nested loop language detection with O(1) dictionary lookup using `ext_to_lang[_ext]` - Eliminates the expensive `_get_language_from_extension()` function calls that were consuming 5.7% of total runtime - Line profiler shows this reduces per-file language detection from ~4894ns to ~303ns per hit **2. Directory-Level Exclusion Filtering** - Moves directory exclusion checks (`EXCLUDED_DIRS & root_parts` and `excluded_paths` matching) outside the file loop - Now checks once per directory rather than once per file, dramatically reducing redundant `Path.resolve()` calls - The original code was calling `Path(root).resolve()` for every single file (66.1% of runtime), now called only once per directory - Processes excluded paths set once upfront instead of recreating it per file **Performance Impact by Test Type:** - **Small repositories** (5-20 files): Moderate gains from reduced per-file overhead - **Medium repositories** (50-200 files): Significant gains as directory-level filtering eliminates most redundant path operations - **Large/nested repositories**: Maximum benefit as the optimizations scale linearly with file count rather than quadratically The optimizations are particularly effective for repositories with many files per directory and multiple excluded directories, which are common in real-world codebases with build artifacts, dependencies, and test directories.

codeflash-ai bot requested a review from mashraf-222 October 21, 2025 20:49

codeflash-ai bot added the ⚡️ codeflash Optimization PR opened by Codeflash AI label Oct 21, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

⚡️ Speed up function `get_source_code_files` by 359% #31

⚡️ Speed up function `get_source_code_files` by 359% #31

Uh oh!

codeflash-ai bot commented Oct 21, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

⚡️ Speed up function get_source_code_files by 359% #31

Are you sure you want to change the base?

⚡️ Speed up function get_source_code_files by 359% #31

Uh oh!

Conversation

codeflash-ai bot commented Oct 21, 2025

📄 359% (3.59x) speedup for get_source_code_files in cognee/tasks/repo_processor/get_repo_file_dependencies.py

📝 Explanation and details

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

⚡️ Speed up function `get_source_code_files` by 359% #31

⚡️ Speed up function `get_source_code_files` by 359% #31

📄 359% (3.59x) speedup for `get_source_code_files` in `cognee/tasks/repo_processor/get_repo_file_dependencies.py`