Skip to content

[Persistent Tasks] Assign based on ProjectId #130391

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 14 commits into
base: main
Choose a base branch
from

Conversation

prwhelan
Copy link
Member

@prwhelan prwhelan commented Jul 1, 2025

Pass the ProjectId from PersistentTaskClusterService through to all PersistentTasksExecutors when creating node assignments.

These PersistentTasksExecutors require the ProjectId during node assignment:

  • OpenJobPersistentTasksExecutor
  • SnapshotUpgradeTaskExecutor
  • StartDatafeedPersistentTasksExecutor
  • TransformPersistentTasksExecutor

Pass the ProjectId from PersistentTaskClusterService through to all
PersistentTasksExecutors when creating node assignments.

These PersistentTasksExecutors require the ProjectId during node
assignment:
- OpenJobPersistentTasksExecutor
- SnapshotUpgradeTaskExecutor
- StartDatafeedPersistentTasksExecutor
- TransformPersistentTasksExecutor

Implemented TransformPersistentTasksExecutor's getAssignment using the
ProjectId by reading the TransformMetadata and RoutingTable from the
ProjectMetadata.
@prwhelan prwhelan added >refactoring :ml Machine learning Team:ML Meta label for the ML team v9.2.0 labels Jul 1, 2025
@prwhelan prwhelan changed the title [Transform] Assign based on ProjectId [Persistent Tasks] Assign based on ProjectId Jul 3, 2025
@elasticsearchmachine elasticsearchmachine added the serverless-linked Added by automation, don't add manually label Jul 3, 2025
@prwhelan prwhelan marked this pull request as ready for review July 3, 2025 17:56
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/ml-core (Team:ML)

Comment on lines -139 to +142
ClusterState clusterState
ClusterState clusterState,
@Nullable ProjectId projectId
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We'll probably want to split this method up into a cluster-scoped version and a project-scoped version. The health node persistent task is cluster-scoped, so it doesn't really make sense to have a project ID here or in other cluster-scoped persistent tasks (even though it's nullable). Let me know if "cluster-scoped vs. project-scoped persistent tasks" sound unfamiliar to you, then I (or Yang) can explain what they are. But I'll let @ywangd decide whether he agrees or whether he's fine with the nullable project ID like this.

If we decide to split it up, we can one method without a project ID (for cluster-scoped tasks) and one with a ProjectState (instead of a ClusterState and ProjectId). We created ProjectState to avoid passing cluster states together with project IDs.

Copy link
Member Author

@prwhelan prwhelan Jul 7, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm happy to rework this, I'm not thrilled about passing nulls around. I'd be happy to instead add a method to the parent, something like:

public Assignment getAssignment(Params params, Collection<DiscoveryNode> candidateNodes, ProjectState projectState) {

and then PersistentTasksClusterService can call the ProjectState or ClusterState API depending on the scope of the persistent task? I'm not sure if that is cleaner for the persistent task framework at the base level but it feels cleaner for the implementations.

99% of this PR was written by IntelliJ's refactor button so we'd only be throwing away minutes of work =)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, something like that is what I had in mind as well (perhaps with a different name for clarity, i.e. getProjectScopedAssignment and getClusterScopedAssignment, but that wouldn't be a blocker for me). Curious to hear what Yang thinks of all this.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah we can go with two separate methods. The PersistentTasksExecutor#scope method can be used to tell the scope of the executor and subsequently call the relevant getXxxAssignment method.

Theoretically we can have a single overriden generic method for project and cluster scoped task executors, if ProjectState and ClusterState shares some interface, e.g. Supplier<ClusterState>. It should help reducing verbosity of the types. We will still need to check the task executor types and pass either ClusterState or ProjectState to the method accordingly. This might be something worth doing in future since it feels like a better type system. But it is definitely outside of this PR.

@prwhelan prwhelan changed the title [Persistent Tasks] Assign based on ProjectId [Persistent Tasks] Assign based on ProjectState Jul 10, 2025
@ywangd
Copy link
Member

ywangd commented Jul 11, 2025

I am very sorry to say this. But I changed my mind after seeing the changes for two separate methods. I now prefer the original approach with the nullable ProjectId parameter.

The main issue with two separate methods is that we have no good way to ensure subclasses implement the right method. That is, if a project scoped task executor implements getClusterScopedAssignment, it is going to be silently skipped and no good way to detect this error. A subclass can still do the wrong thing if we have a single method with nullable ProjectId. But at least the wrong method will always be excercised and error is likely easier to spot. It is not a big concern right now since the methods are called in only a few places. But the usages can expand and so can the concern.

The other good thing is that we can add meaningingful assertions in the superclass like the follows since the executor itself knows its scope:

    // final so not overridable and enforces the assertion
    public final Assignment getAssignment(Params params, Collection<DiscoveryNode> candidateNodes, ClusterState clusterState, @Nullable ProjectId projectId) {
        assert (scope() == Scope.PROJECT && projectId != null) 
            || (scope() == Scope.CLUSTER && projectId == null);
        doGetAssignment(params, candidateNodes, clusterState, projectId);
    }

    // Overridable by subclasses
    public Assignment doGetAssignment(Params params, Collection<DiscoveryNode> candidateNodes, ClusterState clusterState, @Nullable ProjectId projectId) {
        ...
    }

In other places, e.g. PersistentTasksClusterService, we did prefer having separate project and cluster scoped methods. But those methods are not meant to be overidden so that the service class itself has tight control and incorrect usage will be detected due to mismatching executor scope and project context. I think the situation here is different: We have a single superclass, but with two separate base methods, it pretends to be two separate super-classes when it comes to overriding. So I think it is better to stick with one method. In future, if we introduce different base classes for different executor scope as commented earlier, that would be a better time to get rid of the nullable ProjectId.

Apologies for the back and forth. I am happy to hear your thoughts. If we do agree to change it back, I hope it is not too much other than reverting the this commit (6bceb18). Thank you! 🙏

@prwhelan
Copy link
Member Author

I am very sorry to say this. But I changed my mind after seeing the changes for two separate methods. I now prefer the original approach with the nullable ProjectId parameter.

The main issue with two separate methods is that we have no good way to ensure subclasses implement the right method. That is, if a project scoped task executor implements getClusterScopedAssignment, it is going to be silently skipped and no good way to detect this error. A subclass can still do the wrong thing if we have a single method with nullable ProjectId. But at least the wrong method will always be excercised and error is likely easier to spot. It is not a big concern right now since the methods are called in only a few places. But the usages can expand and so can the concern.

The other good thing is that we can add meaningingful assertions in the superclass like the follows since the executor itself knows its scope:

    // final so not overridable and enforces the assertion
    public final Assignment getAssignment(Params params, Collection<DiscoveryNode> candidateNodes, ClusterState clusterState, @Nullable ProjectId projectId) {
        assert (scope() == Scope.PROJECT && projectId != null) 
            || (scope() == Scope.CLUSTER && projectId == null);
        doGetAssignment(params, candidateNodes, clusterState, projectId);
    }

    // Overridable by subclasses
    public Assignment doGetAssignment(Params params, Collection<DiscoveryNode> candidateNodes, ClusterState clusterState, @Nullable ProjectId projectId) {
        ...
    }

In other places, e.g. PersistentTasksClusterService, we did prefer having separate project and cluster scoped methods. But those methods are not meant to be overidden so that the service class itself has tight control and incorrect usage will be detected due to mismatching executor scope and project context. I think the situation here is different: We have a single superclass, but with two separate base methods, it pretends to be two separate super-classes when it comes to overriding. So I think it is better to stick with one method. In future, if we introduce different base classes for different executor scope as commented earlier, that would be a better time to get rid of the nullable ProjectId.

Apologies for the back and forth. I am happy to hear your thoughts. If we do agree to change it back, I hope it is not too much other than reverting the this commit (6bceb18). Thank you! 🙏

Yeah that makes sense to me - I briefly tried having an intermediary subclass of PersistentTasksExecutor that replaces getAssignment with the ProjectScope variant, but I thought that may create too many layers of abstraction. Basically:

class PersistentTasksExecutor {
   // final so not overridable and enforces the assertion
   public final Assignment getAssignment(Params params, Collection<DiscoveryNode> candidateNodes, ClusterState clusterState, @Nullable ProjectId projectId) {
       assert (scope() == Scope.PROJECT && projectId != null) 
           || (scope() == Scope.CLUSTER && projectId == null);
       doGetAssignment(params, candidateNodes, clusterState, projectId);
   }

   // Overridable by subclasses
   public Assignment doGetAssignment(Params params, Collection<DiscoveryNode> candidateNodes, ClusterState clusterState, @Nullable ProjectId projectId) {
       ...
   }
}

class ProjectScopedPersistentTasksExecutor extends PersistentTasksExecutor {
   public final Assignment doGetAssignment(Params params, Collection<DiscoveryNode> candidateNodes, ClusterState clusterState, @Nullable ProjectId projectId) {
       assert (scope() == Scope.PROJECT && projectId != null);
       return doGetAssignment(params, candidateNodes, clusterState.projectState(projectId)
   }

   public abstract Assignment doGetAssignment(Params params, Collection<DiscoveryNode> candidateNodes, ProjectState projectState);
}

@prwhelan prwhelan changed the title [Persistent Tasks] Assign based on ProjectState [Persistent Tasks] Assign based on ProjectId Jul 11, 2025
Copy link
Member

@ywangd ywangd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Thanks a lot for the iterations! 👍

* If {@link #scope()} returns CLUSTER, then {@link ProjectId} will be null.
* If {@link #scope()} returns PROJECT, then {@link ProjectId} will not be null.
*/
public Assignment doGetAssignment(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should make this method and all its overridden versions protected. It is meant to be overridden and called by only the subclasses. External callers should use getAssignment which is final and enforces the consistency check.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and all its overridden versions

I think you still need to update the overridden versions; you only updated the base method.

Copy link
Contributor

@nielsbauman nielsbauman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I left one comment on an existing thread, other than that LGTM. Thanks a lot for the iterations, @prwhelan!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:ml Machine learning >refactoring serverless-linked Added by automation, don't add manually Team:ML Meta label for the ML team v9.2.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants