Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Standardize executor limit enforcement for dynamic and static allocation #426

Merged
merged 2 commits into from
Apr 9, 2025

Conversation

MehulBatra
Copy link
Contributor

@MehulBatra MehulBatra commented Apr 3, 2025

There's a bug in how it enforces the maxExecutor boundary.
In the dynamic allocation section, it only checks:
restartedExecutors

if (restartedExecutors.size >= maxExecutor) {
  return
}

But this only counts the restartedExecutors, not all executors. Meanwhile, properly we should check the total:
This correctly checks total executors


if((appInfo.executors.size + restartedExecutors.size) >= executorInstances) {
  return
}

The fix would be to modify the dynamic allocation condition to match the static allocation approach:


if (dynamicAllocationEnabled) {
  val maxExecutor = conf.getInt("spark.dynamicAllocation.maxExecutors", 0)
  if ((appInfo.executors.size + restartedExecutors.size) >= maxExecutor) {
    return
  }
}

This bug explains exactly why we are getting maxExecutor set to 3 but 4 pods/machines available, it keeps creating executors until it hits the physical limit rather than respecting your configured maximum.

Static Allocation:
This handles executor scaling when Spark's dynamic allocation is disabled:

      else {
        val executorInstances = conf.getInt("spark.executor.instances", 0)
        if (executorInstances != 0) {
          if((appInfo.executors.size + restartedExecutors.size) >=  executorInstances) {
            return
          }
        }
      }

@pang-wu
Copy link
Contributor

pang-wu commented Apr 7, 2025

@MehulBatra the build failed due to lint, do you mind fix that?

@MehulBatra
Copy link
Contributor Author

@MehulBatra the build failed due to lint, do you mind fix that?

Linting is fixed, This is a port conflict issue. Despite setting include_dashboard=False in the ray.init() call, Ray is still trying to allocate ports for dashboard-related components, and it's assigning the same port (52365) to both dashboard_agent_grpc and dashboard_agent_http, could you help me with this @pang-wu

@pang-wu
Copy link
Contributor

pang-wu commented Apr 7, 2025

@carsonwang can you help here -- I don't have access to the github action jobs.

@carsonwang
Copy link
Collaborator

Restarted the test.

Copy link
Contributor

@pang-wu pang-wu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

Copy link
Collaborator

@carsonwang carsonwang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you all!

@carsonwang carsonwang merged commit 5c443c0 into oap-project:master Apr 9, 2025
16 of 32 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants