Standardize executor limit enforcement for dynamic and static allocation #426

MehulBatra · 2025-04-03T12:33:00Z

There's a bug in how it enforces the maxExecutor boundary.
In the dynamic allocation section, it only checks:
restartedExecutors

if (restartedExecutors.size >= maxExecutor) {
  return
}

But this only counts the restartedExecutors, not all executors. Meanwhile, properly we should check the total:
This correctly checks total executors


if((appInfo.executors.size + restartedExecutors.size) >= executorInstances) {
  return
}

The fix would be to modify the dynamic allocation condition to match the static allocation approach:


if (dynamicAllocationEnabled) {
  val maxExecutor = conf.getInt("spark.dynamicAllocation.maxExecutors", 0)
  if ((appInfo.executors.size + restartedExecutors.size) >= maxExecutor) {
    return
  }
}

This bug explains exactly why we are getting maxExecutor set to 3 but 4 pods/machines available, it keeps creating executors until it hits the physical limit rather than respecting your configured maximum.

Static Allocation:
This handles executor scaling when Spark's dynamic allocation is disabled:

      else {
        val executorInstances = conf.getInt("spark.executor.instances", 0)
        if (executorInstances != 0) {
          if((appInfo.executors.size + restartedExecutors.size) >=  executorInstances) {
            return
          }
        }
      }

pang-wu · 2025-04-07T12:30:43Z

@MehulBatra the build failed due to lint, do you mind fix that?

MehulBatra · 2025-04-07T14:32:54Z

@MehulBatra the build failed due to lint, do you mind fix that?

Linting is fixed, This is a port conflict issue. Despite setting include_dashboard=False in the ray.init() call, Ray is still trying to allocate ports for dashboard-related components, and it's assigning the same port (52365) to both dashboard_agent_grpc and dashboard_agent_http, could you help me with this @pang-wu

pang-wu · 2025-04-07T15:52:12Z

@carsonwang can you help here -- I don't have access to the github action jobs.

carsonwang · 2025-04-08T02:52:57Z

Restarted the test.

pang-wu

lgtm

carsonwang

Thank you all!

Standardize executor limit enforcement for dynamic and static allocation

763dad5

Fix linting

055dc9c

pang-wu approved these changes Apr 8, 2025

View reviewed changes

carsonwang approved these changes Apr 9, 2025

View reviewed changes

carsonwang merged commit 5c443c0 into oap-project:master Apr 9, 2025
16 of 32 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Standardize executor limit enforcement for dynamic and static allocation #426

Standardize executor limit enforcement for dynamic and static allocation #426

MehulBatra commented Apr 3, 2025 •

edited

Loading

pang-wu commented Apr 7, 2025

MehulBatra commented Apr 7, 2025

pang-wu commented Apr 7, 2025

carsonwang commented Apr 8, 2025

pang-wu left a comment

carsonwang left a comment

Standardize executor limit enforcement for dynamic and static allocation #426

Standardize executor limit enforcement for dynamic and static allocation #426

Conversation

MehulBatra commented Apr 3, 2025 • edited Loading

pang-wu commented Apr 7, 2025

MehulBatra commented Apr 7, 2025

pang-wu commented Apr 7, 2025

carsonwang commented Apr 8, 2025

pang-wu left a comment

Choose a reason for hiding this comment

carsonwang left a comment

Choose a reason for hiding this comment

MehulBatra commented Apr 3, 2025 •

edited

Loading