Skip to content

Conversation

ahgraber
Copy link
Contributor

Like @mludvig found in #2071, I notice the find_indirect_clusters' use of exhaustive depth-first search to be a significant barrier to use on any, even moderately substantial, knowledge graph. That PR uses BFS to identify a set of disjoint clusters involving the source node (each node appears in at most one cluster) whereas the original find_indirect_clusters identifies all sets of clusters up to length depth_limit from each node. @mludvig, if I'm out of line here, please correct me!

The approach in this PR instead identifies neighborhoods in the graph using a Leiden clustering algorithm and samples from the neighborhoods. I believe this to be a better approach - in my testing it is even faster than BFS, and the function returns something more in line to the original find_indirect_clusters implementation.

I would have preferred (and, in fact, originally tried) using a Clique Percolation Method approach because it allows nodes to belong to multiple neighborhoods; however CPM also ends up running into NP-hard / NP-complete runtime issues.

This PR does add two dependencies to Ragas - networkx and scikit-network.

ahgraber added 8 commits July 24, 2025 13:36
- Added depth limit for relationship clustering.
- Utilized NetworkX to create a graph and identify clusters.
- Implemented filtering of valid paths based on relationships.
- Removed previous depth-first search approach in favor of clique-based clustering.
…den algorithm for community detection.

- Implemented a new method `get_node_by_id` in `KnowledgeGraph` for retrieving nodes by their ID.
- Enhanced the `find_indirect_clusters` method to utilize the Leiden algorithm for community detection.
- Updated unit tests to reflect changes in cluster structures and relationships.
… faster (non-exhaustive) generation

- refactored find_indirect_clusters to separate helper functions
- Enhanced the `find_indirect_clusters` method to sample from random walks within the clusters.
- Minor refinements to unit tests.
@dosubot dosubot bot added the size:XL This PR changes 500-999 lines, ignoring generated files. label Jul 25, 2025
Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Greptile Summary

This PR addresses a critical performance issue in the find_indirect_clusters() method by replacing an exponential-time depth-first search algorithm with a Leiden clustering approach. The original implementation had O(exp(n)) complexity and would run indefinitely on moderately large knowledge graphs (reported cases of running for days on graphs with ~1800 nodes and ~900k relationships).

The new implementation uses community detection via the Leiden algorithm to identify neighborhoods in the graph, then either exhaustively enumerates paths for small clusters or samples paths for larger ones. This maintains the original algorithm's semantics of allowing nodes to appear in multiple clusters (unlike BFS approaches that create disjoint clusters) while dramatically improving performance.

Key changes include:

  • Complete rewrite of find_indirect_clusters() method in graph.py
  • Addition of helper functions: get_node_clusters(), max_simple_paths(), and sample_paths_from_graph()
  • New dependencies: networkx and scikit-network added to pyproject.toml
  • Comprehensive test suite in new test_graph.py file
  • Removal of test_knowledge_graph_save.py (appears unintentional)

The algorithm now uses a threshold of 1000 estimated paths to decide between exhaustive enumeration and sampling, providing a balance between performance and completeness. This change is essential for the multi-hop query synthesizer functionality, which depends on find_indirect_clusters to identify node clusters for generating test scenarios.

Confidence score: 2/5

  • This PR has significant algorithmic changes that could introduce subtle behavioral differences in production
  • The removal of Unicode character testing seems unrelated to the clustering optimization and potentially risky
  • Magic numbers (sample_size=1000, threshold=1000) are hardcoded without clear justification
  • The switch from exhaustive to probabilistic sampling fundamentally changes the algorithm's guarantees
  • ragas/src/ragas/testset/graph.py, ragas/tests/unit/test_graph.py need more attention due to algorithmic complexity and test coverage gaps

4 files reviewed, 5 comments

Edit Code Review Bot Settings | Greptile

Comment on lines 193 to 230
# print(f"\n=== Depth Limit {depth_limit} ===")
# print(f"Found {len(clusters)} clusters, expected {len(expected_clusters)}")

# # Helper function to get node names from a cluster
# def get_cluster_names(cluster):
# return sorted(
# [node.properties.get("id", str(node.id)[:6]) for node in cluster]
# )

# print("\nFound clusters:")
# for i, cluster in enumerate(
# sorted(clusters, key=lambda c: (len(c), get_cluster_names(c)))
# ):
# names = get_cluster_names(cluster)
# print(f" {i + 1}. {{{', '.join(names)}}}")

# print("\nExpected clusters:")
# for i, cluster in enumerate(
# sorted(expected_clusters, key=lambda c: (len(c), get_cluster_names(c)))
# ):
# names = get_cluster_names(cluster)
# print(f" {i + 1}. {{{', '.join(names)}}}")

# # Show differences if any
# found_sets = {frozenset(get_cluster_names(c)) for c in clusters}
# expected_sets = {frozenset(get_cluster_names(c)) for c in expected_clusters}

# if found_sets != expected_sets:
# missing = expected_sets - found_sets
# extra = found_sets - expected_sets
# if missing:
# print(f"\nMissing clusters: {[set(s) for s in missing]}")
# if extra:
# print(f"Extra clusters: {[set(s) for s in extra]}")
# else:
# print("\n✓ All clusters match!")
# print("=" * 40)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

style: Remove commented-out debug code before merging

Comment on lines 382 to 388
while True:
nodes_with_no_outpaths = [
n for n in graph.nodes() if graph.out_degree(n) == 0
]
if not nodes_with_no_outpaths:
break
graph.remove_nodes_from(nodes_with_no_outpaths)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

logic: Modifies input graph in-place which could cause unexpected side effects. Consider using graph.copy() first.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
size:XL This PR changes 500-999 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant