-
Notifications
You must be signed in to change notification settings - Fork 25
fix(backend): improve uniqueness constraint query #6271
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: stable
Are you sure you want to change the base?
Conversation
da8c525
to
9afab1f
Compare
CodSpeed Performance ReportMerging #6271 will not alter performanceComparing Summary
|
3e8083f
to
9d8175b
Compare
c99e4d2
to
fb6a039
Compare
@ajtmccarty I'd like some help to test this change, I'm not sure if we have enough unit tests to cover all cases. I'm especially afraid of cases when we have multiple uniqueness constraints tuples using the same attributes. |
699288c
to
688bd82
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I pushed a commit with some more unit tests that I think cover the cases you are worried about, but let me know if you think others are required
This should avoid processing useless rows when we have relationship only constraints. This especially improves the IPAM use-case, where the uniqueness constraint is the (address/prefix, namespace) tuple. Without this change, the query would check for nodes having that same address and fetch ALL nodes present in the same namespace. With this change, the query is now FIRST checking nodes having that same address and THEN checking if these previous nodes are in the same namespace. Signed-off-by: Fatih Acar <[email protected]>
4158ed5
to
7dbcf43
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
added a commit to correctly support matching paths where there are no attribute-level matches and 1 or more relationship-level matches
@@ -103,48 +103,77 @@ async def query_init(self, db: InfrahubDatabase, **kwargs: Any) -> None: # noqa | |||
) | |||
|
|||
attr_paths_subquery = """ | |||
MATCH attr_path = (start_node:%(node_kind)s)-[:HAS_ATTRIBUTE]->(attr:Attribute)-[r:HAS_VALUE]->(attr_value:AttributeValue) | |||
OPTIONAL MATCH attr_path = (attr_start_node:%(node_kind)s)-[:HAS_ATTRIBUTE]->(attr:Attribute)-[r:HAS_VALUE]->(attr_value:AttributeValue) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
added the OPTIONAL
here to account for cases where there are no attribute-level matches but there are relationship-level matches
""" % {"node_kind": self.query_request.kind} | ||
|
||
relationship_attr_paths_with_value_subquery = """ | ||
MATCH rel_path = (start_node:%(node_kind)s)-[:IS_RELATED]-(relationship_node:Relationship)-[:IS_RELATED]-(related_n:Node)-[:HAS_ATTRIBUTE]->(rel_attr:Attribute)-[:HAS_VALUE]->(rel_attr_value:AttributeValue) | ||
OPTIONAL MATCH rel_path = (rel_start_node:%(node_kind)s)-[:IS_RELATED]-(relationship_node:Relationship)-[:IS_RELATED]-(related_n:Node)-[:HAS_ATTRIBUTE]->(rel_attr:Attribute)-[:HAS_VALUE]->(rel_attr_value:AttributeValue) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
changed the variable name on this line and 120 b/c if the attr_start_node
is null
in the first MATCH
statement, then these later MATCH
s will never return any results while they use the same variable name
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is embarassing... that was the whole point of the change: reusing the same start_node
we used to match attribute values, so that it does not have to traverse a lot of nodes when looking for rel_path without value...
We probably need to build different queries for each case then... I'm not sure we can do a single query that can handle all cases.
} | ||
WITH DISTINCT start_node, potential_path, rel_identifier, potential_attr, potential_attr_value |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
added DISTINCT
here b/c I believe that the initial subquery that identifies attr_start_node
, rel_start_node
, and rel_only_start_node
can return duplicates
select_subqueries_str = "UNION".join(select_subqueries) | ||
select_subqueries_str = "".join(select_subqueries) | ||
return_subqueries_str = ", ".join(returned_attributes) | ||
filter_subqueries_str = "UNION".join(filter_subqueries) | ||
|
||
# ruff: noqa: E501 | ||
query = """ | ||
// get attributes for node and its relationships | ||
CALL { | ||
%(select_subqueries_str)s |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
having all of the initial OPTIONAL MATCH
statements within this single CALL
subquery means that we can get duplicated rows in the results
for example, if the attr_paths_subquery
matches 5 paths and the relationship_only_attr_paths_subquery
matches 5 paths, then we will get 25 rows returned, one for each combination of attr and rel path
I assume that this single CALL
subquery instead of 1-3 separate CALL
subqueries improves performance b/c we are usually operating on a pretty small number of rows, but just wanted to make sure we're all on the same page for how this works
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're right, having a single CALL
subquery allows reusing the start_node
we used when filtering attribute values so that we filter early.
The change you've done above makes this useless now though 🤔
This PR is on hold after internal discussions, another approach is required to solve this. |
This should avoid processing useless rows when we have relationship only constraints.
This especially improves the IPAM use-case, where the uniqueness constraint is the (address/prefix, namespace) tuple. Without this change, the query would check for nodes having that same address and fetch ALL nodes present in the same namespace. With this change, the query is now FIRST checking nodes having that same address and THEN checking if these previous nodes are in the same namespace.
I'm not sure if our test coverage is enough to correctly test this change, a thorough review along with additional unit tests may be required...
EDIT: the "OPTIONAL MATCH" fix was caught only by this test:
backend/tests/unit/core/constraint_validators/test_node_grouped_uniqueness.py::TestNodeGroupedUniquenessConstraint::test_subset_hfid_violated
so I'm afraid of missing more unit tests to cover all cases...