[SPARK-49783][YARN] Fix resource leak of yarn allocator

zuston · dongjoon-hyun · commit 0467aca97120 · 2025-01-06T20:00:08.000-08:00
### What changes were proposed in this pull request? Fix the resource leak of yarn allocator ### Why are the changes needed? When the target < running containers number, the assigned containers from the resource manager will be skipped, but these containers are not released by invoking the amClient.releaseAssignedContainer , that will make these containers reserved into the Yarn resourceManager at least 10 minutes. And so, the cluster resource will be wasted at a high ratio. And this will reflect that the vcore * seconds statistics from yarn side will be greater than the result from the spark event logs. From my statistics, the cluster resource waste ratio is ~25% if the spark jobs are exclusive in this cluster. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? In our internal hadoop cluster ### Was this patch authored or co-authored using generative AI tooling? No Closes #48238 from zuston/patch-1. Authored-by: Junfan Zhang <zuston@apache.org> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
diff --git a/resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/YarnAllocator.scala b/resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/YarnAllocator.scala
@@ -820,6 +820,7 @@ private[yarn] class YarnAllocator(
         logInfo(log"Skip launching executorRunnable as running executors count: " +
           log"${MDC(LogKeys.COUNT, rpRunningExecs)} reached target executors count: " +
           log"${MDC(LogKeys.NUM_EXECUTOR_TARGET, getOrUpdateTargetNumExecutorsForRPId(rpId))}.")
+        internalReleaseContainer(container)
       }
     }
   }

Original file line number	Diff line number	Diff line change
`@@ -820,6 +820,7 @@ private[yarn] class YarnAllocator(`
`820`	`820`	`logInfo(log"Skip launching executorRunnable as running executors count: " +`
`821`	`821`	`log"${MDC(LogKeys.COUNT, rpRunningExecs)} reached target executors count: " +`
`822`	`822`	`log"${MDC(LogKeys.NUM_EXECUTOR_TARGET, getOrUpdateTargetNumExecutorsForRPId(rpId))}.")`
	`823`	`+ internalReleaseContainer(container)`
`823`	`824`	`}`
`824`	`825`	`}`
`825`	`826`	`}`