Skip to content

Add subquery support for EmbeddedModelArrayField #314

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 2 commits into from
Jun 18, 2025

Conversation

WaVEV
Copy link
Collaborator

@WaVEV WaVEV commented Jun 6, 2025

Subquery and annotation handling.

@WaVEV WaVEV marked this pull request as ready for review June 9, 2025 21:34
@WaVEV WaVEV requested review from timgraham and Jibola June 9, 2025 21:34
Jibola
Jibola previously requested changes Jun 10, 2025
Comment on lines 150 to 170
{"$project": {"tmp_name": expr.as_mql(compiler, connection)}},
{
"$unwind": "$tmp_name",
},
{
"$group": {
"_id": None,
"tmp_name": {"$addToSet": "$tmp_name"},
}
},
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I understand this code correctly:

  1. We rename a field (The array) to a tmp_name
    a. If there's a subquery, it will produce another array.
  2. We unwind (the array)
  3. We take the value and add it back to a set to group all items into one
  4. We only pick the first item in the set? Or is this returning an array of arrays?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, the field we're querying on the RHS consists of multiple arrays.
Maybe an example will make this clearer. This example uses ArrayField, but it's analogous to EMAF:
Given the following documents.

[{'_id': ObjectId('684a4ddbc6db6230d7591003'),
  'field': [1],
  'field_nested': None,
  'order': 1},
 {'_id': ObjectId('684a4ddbc6db6230d7591004'),
  'field': [2],
  'field_nested': None,
  'order': 2},
 {'_id': ObjectId('684a4ddbc6db6230d7591005'),
  'field': [2, 3],
  'field_nested': None,
  'order': 3},
 {'_id': ObjectId('684a4ddbc6db6230d7591006'),
  'field': [20, 30, 40],
  'field_nested': None,
  'order': 4},
 {'_id': ObjectId('684a4ddbc6db6230d7591007'),
  'field': None,
  'field_nested': None,
  'order': 5}]

Suppose we want to apply an $overlap query using a subquery. For example:

qs = NullableIntegerArrayModel.objects.filter(order__lt=3)
        self.assertCountEqual(
            NullableIntegerArrayModel.objects.filter(
                field__overlap=qs.values_list("field"),
            ),
            self.objs[:3],
        )

To do this, we need to concatenate all the values from the RHS subquery. That means using an $unwind followed by a $group.

The $group stage collects values into an array using $addToSet. Since we use _id: null, this results in a single grouped array. However, because we're aggregating arrays from multiple documents, the result is a list of lists.

This pipeline was originally copied or adapted from ArrayField, because the structure of EmbeddedModelArrayField on the RHS behaves similarly to ArrayField.

return [
{
"$facet": {
"group": [
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
"group": [
"group": [

Let's name it something else. Since group is an existing predicate, it could get confusing. How about array_agg_query_mixin. Just be as explicit as that

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤔 IMO naming with some with meaning in the context of the pipeline is better. I am thinking in something like gathered_values, or concatenated_values.

Comment on lines 320 to 352
# This test would be useful to have, but it cannot be implemented
# due to the current annotation handling.
# Supporting slicing or indexing over an annotated
# Embedded Array Field would require a refactor.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this a marked limitation?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The query is really quite obscure, so I don't think it needs to be documented, if that's what you meant.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤔 , yes. it is obscure. Maybe the behave isn't expected nor documented. It tries to show that an annotated embeddedmodelArrayField should be handled very similar to an Arrayfield.
BTW, I think the annotation should be reworked in the searchVector. So, we could return to this if we have time and we want to have this kind of functionality.

@WaVEV WaVEV force-pushed the arrayfield+emf-subquerying branch from b4eb3c6 to b77a335 Compare June 13, 2025 01:34
@timgraham timgraham changed the title EmbeddedModelArrayField Subquerying Add subquery support to EmbeddedModelArrayField Jun 18, 2025
@timgraham timgraham changed the title Add subquery support to EmbeddedModelArrayField Add subquery support for EmbeddedModelArrayField Jun 18, 2025
@timgraham timgraham force-pushed the arrayfield+emf-subquerying branch from b77a335 to 783aa8a Compare June 18, 2025 01:37
@timgraham timgraham dismissed Jibola’s stale review June 18, 2025 19:11

changes addressed

@timgraham timgraham merged commit b8efc93 into mongodb:main Jun 18, 2025
15 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants