-
Notifications
You must be signed in to change notification settings - Fork 1k
Reduce find_indirect_clusters() runtime through neighborhood detection and sampling #2144
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Reduce find_indirect_clusters() runtime through neighborhood detection and sampling #2144
Conversation
- Added depth limit for relationship clustering. - Utilized NetworkX to create a graph and identify clusters. - Implemented filtering of valid paths based on relationships. - Removed previous depth-first search approach in favor of clique-based clustering.
…den algorithm for community detection. - Implemented a new method `get_node_by_id` in `KnowledgeGraph` for retrieving nodes by their ID. - Enhanced the `find_indirect_clusters` method to utilize the Leiden algorithm for community detection. - Updated unit tests to reflect changes in cluster structures and relationships.
… faster (non-exhaustive) generation - refactored find_indirect_clusters to separate helper functions - Enhanced the `find_indirect_clusters` method to sample from random walks within the clusters. - Minor refinements to unit tests.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Greptile Summary
This PR addresses a critical performance issue in the find_indirect_clusters()
method by replacing an exponential-time depth-first search algorithm with a Leiden clustering approach. The original implementation had O(exp(n)) complexity and would run indefinitely on moderately large knowledge graphs (reported cases of running for days on graphs with ~1800 nodes and ~900k relationships).
The new implementation uses community detection via the Leiden algorithm to identify neighborhoods in the graph, then either exhaustively enumerates paths for small clusters or samples paths for larger ones. This maintains the original algorithm's semantics of allowing nodes to appear in multiple clusters (unlike BFS approaches that create disjoint clusters) while dramatically improving performance.
Key changes include:
- Complete rewrite of
find_indirect_clusters()
method ingraph.py
- Addition of helper functions:
get_node_clusters()
,max_simple_paths()
, andsample_paths_from_graph()
- New dependencies:
networkx
andscikit-network
added topyproject.toml
- Comprehensive test suite in new
test_graph.py
file - Removal of
test_knowledge_graph_save.py
(appears unintentional)
The algorithm now uses a threshold of 1000 estimated paths to decide between exhaustive enumeration and sampling, providing a balance between performance and completeness. This change is essential for the multi-hop query synthesizer functionality, which depends on find_indirect_clusters
to identify node clusters for generating test scenarios.
Confidence score: 2/5
- This PR has significant algorithmic changes that could introduce subtle behavioral differences in production
- The removal of Unicode character testing seems unrelated to the clustering optimization and potentially risky
- Magic numbers (sample_size=1000, threshold=1000) are hardcoded without clear justification
- The switch from exhaustive to probabilistic sampling fundamentally changes the algorithm's guarantees
ragas/src/ragas/testset/graph.py
,ragas/tests/unit/test_graph.py
need more attention due to algorithmic complexity and test coverage gaps
4 files reviewed, 5 comments
ragas/tests/unit/test_graph.py
Outdated
# print(f"\n=== Depth Limit {depth_limit} ===") | ||
# print(f"Found {len(clusters)} clusters, expected {len(expected_clusters)}") | ||
|
||
# # Helper function to get node names from a cluster | ||
# def get_cluster_names(cluster): | ||
# return sorted( | ||
# [node.properties.get("id", str(node.id)[:6]) for node in cluster] | ||
# ) | ||
|
||
# print("\nFound clusters:") | ||
# for i, cluster in enumerate( | ||
# sorted(clusters, key=lambda c: (len(c), get_cluster_names(c))) | ||
# ): | ||
# names = get_cluster_names(cluster) | ||
# print(f" {i + 1}. {{{', '.join(names)}}}") | ||
|
||
# print("\nExpected clusters:") | ||
# for i, cluster in enumerate( | ||
# sorted(expected_clusters, key=lambda c: (len(c), get_cluster_names(c))) | ||
# ): | ||
# names = get_cluster_names(cluster) | ||
# print(f" {i + 1}. {{{', '.join(names)}}}") | ||
|
||
# # Show differences if any | ||
# found_sets = {frozenset(get_cluster_names(c)) for c in clusters} | ||
# expected_sets = {frozenset(get_cluster_names(c)) for c in expected_clusters} | ||
|
||
# if found_sets != expected_sets: | ||
# missing = expected_sets - found_sets | ||
# extra = found_sets - expected_sets | ||
# if missing: | ||
# print(f"\nMissing clusters: {[set(s) for s in missing]}") | ||
# if extra: | ||
# print(f"Extra clusters: {[set(s) for s in extra]}") | ||
# else: | ||
# print("\n✓ All clusters match!") | ||
# print("=" * 40) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
style: Remove commented-out debug code before merging
ragas/src/ragas/testset/graph.py
Outdated
while True: | ||
nodes_with_no_outpaths = [ | ||
n for n in graph.nodes() if graph.out_degree(n) == 0 | ||
] | ||
if not nodes_with_no_outpaths: | ||
break | ||
graph.remove_nodes_from(nodes_with_no_outpaths) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
logic: Modifies input graph in-place which could cause unexpected side effects. Consider using graph.copy() first.
…ssues - Changed type hint from `uuid.UUID | str` to `t.Union[uuid.UUID, str]` for clarity. - Added type: ignore comments to bypass type hint issues in sknetwork Dataset.
Like @mludvig found in #2071, I notice the find_indirect_clusters' use of exhaustive depth-first search to be a significant barrier to use on any, even moderately substantial, knowledge graph. That PR uses BFS to identify a set of disjoint clusters involving the source node (each node appears in at most one cluster) whereas the original find_indirect_clusters identifies all sets of clusters up to length depth_limit from each node. @mludvig, if I'm out of line here, please correct me!
The approach in this PR instead identifies neighborhoods in the graph using a Leiden clustering algorithm and samples from the neighborhoods. I believe this to be a better approach - in my testing it is even faster than BFS, and the function returns something more in line to the original
find_indirect_clusters
implementation.I would have preferred (and, in fact, originally tried) using a Clique Percolation Method approach because it allows nodes to belong to multiple neighborhoods; however CPM also ends up running into NP-hard / NP-complete runtime issues.
This PR does add two dependencies to Ragas - networkx and scikit-network.