Reduce find_indirect_clusters() runtime through neighborhood detection and sampling #2144

ahgraber · 2025-07-25T23:53:33Z

Like @mludvig found in #2071, I notice the find_indirect_clusters' use of exhaustive depth-first search to be a significant barrier to use on any, even moderately substantial, knowledge graph. That PR uses BFS to identify a set of disjoint clusters involving the source node (each node appears in at most one cluster) whereas the original find_indirect_clusters identifies all sets of clusters up to length depth_limit from each node. @mludvig, if I'm out of line here, please correct me!

The approach in this PR instead identifies neighborhoods in the graph using a Leiden clustering algorithm and samples from the neighborhoods. I believe this to be a better approach - in my testing it is even faster than BFS, and the function returns something more in line to the original find_indirect_clusters implementation.

I would have preferred (and, in fact, originally tried) using a Clique Percolation Method approach because it allows nodes to belong to multiple neighborhoods; however CPM also ends up running into NP-hard / NP-complete runtime issues.

This PR does add two dependencies to Ragas - networkx and scikit-network.

…roader tests

- Added depth limit for relationship clustering. - Utilized NetworkX to create a graph and identify clusters. - Implemented filtering of valid paths based on relationships. - Removed previous depth-first search approach in favor of clique-based clustering.

…den algorithm for community detection. - Implemented a new method `get_node_by_id` in `KnowledgeGraph` for retrieving nodes by their ID. - Enhanced the `find_indirect_clusters` method to utilize the Leiden algorithm for community detection. - Updated unit tests to reflect changes in cluster structures and relationships.

… faster (non-exhaustive) generation - refactored find_indirect_clusters to separate helper functions - Enhanced the `find_indirect_clusters` method to sample from random walks within the clusters. - Minor refinements to unit tests.

greptile-apps

Greptile Summary

This PR addresses a critical performance issue in the find_indirect_clusters() method by replacing an exponential-time depth-first search algorithm with a Leiden clustering approach. The original implementation had O(exp(n)) complexity and would run indefinitely on moderately large knowledge graphs (reported cases of running for days on graphs with ~1800 nodes and ~900k relationships).

The new implementation uses community detection via the Leiden algorithm to identify neighborhoods in the graph, then either exhaustively enumerates paths for small clusters or samples paths for larger ones. This maintains the original algorithm's semantics of allowing nodes to appear in multiple clusters (unlike BFS approaches that create disjoint clusters) while dramatically improving performance.

Key changes include:

Complete rewrite of find_indirect_clusters() method in graph.py
Addition of helper functions: get_node_clusters(), max_simple_paths(), and sample_paths_from_graph()
New dependencies: networkx and scikit-network added to pyproject.toml
Comprehensive test suite in new test_graph.py file
Removal of test_knowledge_graph_save.py (appears unintentional)

The algorithm now uses a threshold of 1000 estimated paths to decide between exhaustive enumeration and sampling, providing a balance between performance and completeness. This change is essential for the multi-hop query synthesizer functionality, which depends on find_indirect_clusters to identify node clusters for generating test scenarios.

Confidence score: 2/5

This PR has significant algorithmic changes that could introduce subtle behavioral differences in production
The removal of Unicode character testing seems unrelated to the clustering optimization and potentially risky
Magic numbers (sample_size=1000, threshold=1000) are hardcoded without clear justification
The switch from exhaustive to probabilistic sampling fundamentally changes the algorithm's guarantees
ragas/src/ragas/testset/graph.py, ragas/tests/unit/test_graph.py need more attention due to algorithmic complexity and test coverage gaps

_{4 files reviewed, 5 comments}

_{Edit Code Review Bot Settings | Greptile}

greptile-apps · 2025-07-25T23:54:37Z

ragas/tests/unit/test_graph.py

+        # print(f"\n=== Depth Limit {depth_limit} ===")
+        # print(f"Found {len(clusters)} clusters, expected {len(expected_clusters)}")
+
+        # # Helper function to get node names from a cluster
+        # def get_cluster_names(cluster):
+        #     return sorted(
+        #         [node.properties.get("id", str(node.id)[:6]) for node in cluster]
+        #     )
+
+        # print("\nFound clusters:")
+        # for i, cluster in enumerate(
+        #     sorted(clusters, key=lambda c: (len(c), get_cluster_names(c)))
+        # ):
+        #     names = get_cluster_names(cluster)
+        #     print(f"  {i + 1}. {{{', '.join(names)}}}")
+
+        # print("\nExpected clusters:")
+        # for i, cluster in enumerate(
+        #     sorted(expected_clusters, key=lambda c: (len(c), get_cluster_names(c)))
+        # ):
+        #     names = get_cluster_names(cluster)
+        #     print(f"  {i + 1}. {{{', '.join(names)}}}")
+
+        # # Show differences if any
+        # found_sets = {frozenset(get_cluster_names(c)) for c in clusters}
+        # expected_sets = {frozenset(get_cluster_names(c)) for c in expected_clusters}
+
+        # if found_sets != expected_sets:
+        #     missing = expected_sets - found_sets
+        #     extra = found_sets - expected_sets
+        #     if missing:
+        #         print(f"\nMissing clusters: {[set(s) for s in missing]}")
+        #     if extra:
+        #         print(f"Extra clusters: {[set(s) for s in extra]}")
+        # else:
+        #     print("\n✓ All clusters match!")
+        # print("=" * 40)
+


style: Remove commented-out debug code before merging

ragas/pyproject.toml

ragas/tests/unit/test_graph.py

greptile-apps · 2025-07-25T23:54:44Z

ragas/src/ragas/testset/graph.py

+            while True:
+                nodes_with_no_outpaths = [
+                    n for n in graph.nodes() if graph.out_degree(n) == 0
+                ]
+                if not nodes_with_no_outpaths:
+                    break
+                graph.remove_nodes_from(nodes_with_no_outpaths)


logic: Modifies input graph in-place which could cause unexpected side effects. Consider using graph.copy() first.

…ssues - Changed type hint from `uuid.UUID | str` to `t.Union[uuid.UUID, str]` for clarity. - Added type: ignore comments to bypass type hint issues in sknetwork Dataset.

ahgraber added 8 commits July 24, 2025 13:36

refactor: rename test_knowledge_graph_save to test_graph to reflect b…

40b1608

…roader tests

feat: add progress indicators

f9703a8

fix: test_bidirectional needs an assertion test

e3cd1e2

fix: handle generate_random_paths error from nodes with no out-paths

4720a8b

chore: format

3eebeda

dosubot bot added the size:XL This PR changes 500-999 lines, ignoring generated files. label Jul 25, 2025

greptile-apps bot reviewed Jul 25, 2025

View reviewed changes

ahgraber added 5 commits July 25, 2025 20:13

fix: move networkx and scikit-network out of optional dependencies

57c49fb

fix: update type hints for get_node_by_id and handle sknetwork type i…

8937b40

…ssues - Changed type hint from `uuid.UUID | str` to `t.Union[uuid.UUID, str]` for clarity. - Added type: ignore comments to bypass type hint issues in sknetwork Dataset.

Merge branch 'main' into feature/graph-cluster-leiden

9ff7ee8

tests: test_knowledge_graph_save now within test_graph

81d705b

chore: make: check, type, test

1fb32ff

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Reduce find_indirect_clusters() runtime through neighborhood detection and sampling #2144

Reduce find_indirect_clusters() runtime through neighborhood detection and sampling #2144

Uh oh!

ahgraber commented Jul 25, 2025

Uh oh!

greptile-apps bot left a comment

Uh oh!

greptile-apps bot Jul 25, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

greptile-apps bot Jul 25, 2025

Uh oh!

Uh oh!

Reduce find_indirect_clusters() runtime through neighborhood detection and sampling #2144

Are you sure you want to change the base?

Reduce find_indirect_clusters() runtime through neighborhood detection and sampling #2144

Uh oh!

Conversation

ahgraber commented Jul 25, 2025

Uh oh!

greptile-apps bot left a comment

Choose a reason for hiding this comment

Greptile Summary

Confidence score: 2/5

Uh oh!

greptile-apps bot Jul 25, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

greptile-apps bot Jul 25, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!