[SPARK-51136][CORE] Set `CallerContext` for History Server #49858

cnauroth · 2025-02-09T21:29:46Z

What changes were proposed in this pull request?

Initialize the Hadoop RPC CallerContext during History Server startup, before FileSystem access. Calls to HDFS will get tagged in the audit log as originating from the History Server.

Why are the changes needed?

Other YARN-based Spark processes set the CallerContext, so that additional auditing context propagates in Hadoop RPC calls. This PR provides auditing context for calls from the History Server. Other callers provide additional information like app ID, attempt ID, etc. We don't provide that here through History Server, which serves multiple apps/attempts.

Does this PR introduce any user-facing change?

Yes. In environments that configure hadoop.caller.context.enabled=true, users will now see additional information in the HDFS audit logs explicitly stating that calls originated from the History Server.

How was this patch tested?

A new unit test has been added. All tests pass in the history package.

build/mvn -pl core test -Dtest=none -DmembersOnlySuites=org.apache.spark.deploy.history

When the changes are deployed to a running cluster, the new caller context is visible in the HDFS audit logs.

2025-02-07 23:00:54,657 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem.audit: allowed=true	ugi=spark (auth:SIMPLE)	ip=/10.240.5.205	cmd=open	src=/133bcb94-52b8-4356-ad9b-7358c78ce7fd/spark-job-history/application_1738779819434_0012	dst=null	perm=null	proto=rpc	callerContext=SPARK_HISTORY
2025-02-07 23:00:54,683 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem.audit: allowed=true	ugi=spark (auth:SIMPLE)	ip=/10.240.5.205	cmd=open	src=/133bcb94-52b8-4356-ad9b-7358c78ce7fd/spark-job-history/application_1738779819434_0011	dst=null	perm=null	proto=rpc	callerContext=SPARK_HISTORY
2025-02-07 23:00:54,699 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem.audit: allowed=true	ugi=spark (auth:SIMPLE)	ip=/10.240.5.205	cmd=open	src=/133bcb94-52b8-4356-ad9b-7358c78ce7fd/spark-job-history/application_1738779819434_0011	dst=null	perm=null	proto=rpc	callerContext=SPARK_HISTORY
2025-02-07 23:00:54,715 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem.audit: allowed=true	ugi=spark (auth:SIMPLE)	ip=/10.240.5.205	cmd=open	src=/133bcb94-52b8-4356-ad9b-7358c78ce7fd/spark-job-history/application_1738779819434_0010	dst=null	perm=null	proto=rpc	callerContext=SPARK_HISTORY
2025-02-07 23:00:54,729 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem.audit: allowed=true	ugi=spark (auth:SIMPLE)	ip=/10.240.5.205	cmd=open	src=/133bcb94-52b8-4356-ad9b-7358c78ce7fd/spark-job-history/application_1738779819434_0010	dst=null	perm=null	proto=rpc	callerContext=SPARK_HISTORY
2025-02-07 23:00:54,743 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem.audit: allowed=true	ugi=spark (auth:SIMPLE)	ip=/10.240.5.205	cmd=open	src=/133bcb94-52b8-4356-ad9b-7358c78ce7fd/spark-job-history/application_1738779819434_0009	dst=null	perm=null	proto=rpc	callerContext=SPARK_HISTORY
2025-02-07 23:00:54,755 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem.audit: allowed=true	ugi=spark (auth:SIMPLE)	ip=/10.240.5.205	cmd=open	src=/133bcb94-52b8-4356-ad9b-7358c78ce7fd/spark-job-history/application_1738779819434_0009	dst=null	perm=null	proto=rpc	callerContext=SPARK_HISTORY
2025-02-07 23:00:54,767 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem.audit: allowed=true	ugi=spark (auth:SIMPLE)	ip=/10.240.5.205	cmd=open	src=/133bcb94-52b8-4356-ad9b-7358c78ce7fd/spark-job-history/application_1738779819434_0008	dst=null	perm=null	proto=rpc	callerContext=SPARK_HISTORY
2025-02-07 23:00:54,779 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem.audit: allowed=true	ugi=spark (auth:SIMPLE)	ip=/10.240.5.205	cmd=open	src=/133bcb94-52b8-4356-ad9b-7358c78ce7fd/spark-job-history/application_1738779819434_0008	dst=null	perm=null	proto=rpc	callerContext=SPARK_HISTORY
2025-02-07 23:01:04,160 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem.audit: allowed=true	ugi=spark (auth:SIMPLE)	ip=/10.240.5.205	cmd=listStatus	src=/133bcb94-52b8-4356-ad9b-7358c78ce7fd/spark-job-history	dst=null	perm=null	proto=rpc	callerContext=SPARK_HISTORY

Was this patch authored or co-authored using generative AI tooling?

No.

cnauroth · 2025-02-09T21:30:14Z

core/src/main/scala/org/apache/spark/util/Utils.scala

@@ -3151,9 +3152,31 @@ private[spark] object Utils
  }
 }

-private[util] object CallerContext extends Logging {


I needed to relax visibility on things here to facilitate unit testing with caller context enabled. LMK if a different approach is preferred (reflection?).

This change looks okay for me.

cnauroth · 2025-02-09T21:33:28Z

If approved, can this also go into branch-3.5 please? The cherry-pick would need a minor merge conflict resolution in FsHistoryProvider import statements, or I can send a separate pull request.

dongjoon-hyun · 2025-02-10T19:57:08Z

core/src/main/scala/org/apache/spark/util/Utils.scala

-    SparkHadoopUtil.get.conf.getBoolean("hadoop.caller.context.enabled", false)
+private[spark] object CallerContext extends Logging {
+  var callerContextEnabled: Boolean = SparkHadoopUtil.get.conf.getBoolean(
+    HADOOP_CALLER_CONTEXT_ENABLED_KEY, HADOOP_CALLER_CONTEXT_ENABLED_DEFAULT)


If you don't mind, please revert this change. :)

No problem, I can definitely do that. :)

Can you please help me understand the reasoning though? Does the Spark codebase prefer not to reference Hadoop's configuration constants?

Yes, right. We prefer to avoid those compilation dependency. Not only for Hadoop, but also Hive, too.

Thank you, and I will keep it in mind in future patches.

dongjoon-hyun · 2025-02-10T20:01:27Z

core/src/main/scala/org/apache/spark/util/Utils.scala

+   *
+   * VisibleForTesting
+   */
+  def withCallerContextEnabled[T](enabled: Boolean)(func: => T): T = {


This looks like a pure test-helper utility instead of VisibleForTesting. Do we have any benefit in main code?

There is no benefit to main code, other than keeping all access to callerContextEnabled in the same object. Otherwise, people reading the code might be confused about why it's var instead of val.

If you prefer, I can just leave a comment on why it's var and move this to a new CallerContextTestUtils object under test (or some existing test utils file if you have another suggestion).

I would potentially like to reuse this in tests covering other usage of caller context, which is why I didn't put it directly in FsHistoryProviderSuite.

dongjoon-hyun

Could you confirm this PR description claims for non-YARN Spark deamons like Spark Master/Spark Worker/Spark ThriftServer and so on? Or, is this only referring YARN ApplicationMaster and Client-related code?

Other Spark processes set the CallerContext, so that additional auditing context propagates in Hadoop RPC calls.

cnauroth · 2025-02-10T21:41:59Z

Could you confirm this PR description claims for non-YARN Spark deamons like Spark Master/Spark Worker/Spark ThriftServer and so on? Or, is this only referring YARN ApplicationMaster and Client-related code?

Other Spark processes set the CallerContext, so that additional auditing context propagates in Hadoop RPC calls.

@dongjoon-hyun , thank you for the review. That's a good point. I have only confirmed existing support for YARN-based workloads, so I will update the PR description.

We can continue to check support in the other processes.

dongjoon-hyun · 2025-02-10T23:08:58Z

Thank you for replies.

BTW, for the following question, we cannot backport the AS-IS SPARK-51136 because it's filed as Improvement.

If approved, can this also go into branch-3.5 please? The cherry-pick would need a minor merge conflict resolution in FsHistoryProvider import statements, or I can send a separate pull request.

If you want this in Apache Spark 3.5.5, please convert the JIRA to Bug from Improvement and describe the rationals why this is a bug fix. Then, I can help you.

cnauroth · 2025-02-11T01:02:32Z

@dongjoon-hyun , regarding backport to 3.5, I don't think I can justify calling it a bug. It's providing an improvement, not fixing existing functionality that has been broken. I'll retract my request for the backport.

I pushed up a change to stop referencing the Hadoop constants. LMK your preference on placement of the test helper. Happy to push up another change.

#49858 (comment)

dongjoon-hyun · 2025-02-11T06:04:51Z

BTW, I'm investigating the relevant code at this chance while reviewing this PR. It's because the existing test coverage also looks suspicious due to val callerContextEnabled. The code of this PR itself looks okay, but let me figure out what is the best way to test Spark's Hadoop Caller Context support. Sorry for the delay.

spark/core/src/test/scala/org/apache/spark/util/UtilsSuite.scala

Lines 1005 to 1011 in 6a668cd

    
           test("Set Spark CallerContext") { 
        
             val context = "test" 
        
             new CallerContext(context).setCurrentContext() 
        
             if (CallerContext.callerContextEnabled) { 
        
               assert(s"SPARK_$context" === HadoopCallerContext.getCurrent.toString) 
        
             } 
        
           }

cnauroth · 2025-02-11T06:12:24Z

BTW, I'm investigating the relevant code at this chance while reviewing this PR. It's because the existing test coverage also looks suspicious due to val callerContextEnabled. The code of this PR itself looks okay, but let me figure out what is the best way to test Spark's Hadoop Caller Context support. Sorry for the delay.

spark/core/src/test/scala/org/apache/spark/util/UtilsSuite.scala

Lines 1005 to 1011 in 6a668cd

test("Set Spark CallerContext") {

val context = "test"

new CallerContext(context).setCurrentContext()

if (CallerContext.callerContextEnabled) {

assert(s"SPARK_$context" === HadoopCallerContext.getCurrent.toString)

}

}

If coverage looks suspicious, then it might be that the flag is only initialized once per JVM startup, and test runs probably don't have configuration for hadoop.caller.context.enabled=true. Steve hinted at this in #49779. My test helper is trying to work around this (with the downside of compromising visibility and mutability of the flag).

No worries on the delay. Committers are busy! 😄

dongjoon-hyun

I made a PR to enable hadoop.caller.context.enabled always during testing, @cnauroth .

#49893

dongjoon-hyun · 2025-02-11T21:18:55Z

Could you rebase this PR to the master branch, @cnauroth ?

### What changes were proposed in this pull request? Initialize the Hadoop RPC `CallerContext` during History Server startup, before `FileSystem` access. Calls to HDFS will get tagged in the audit log as originating from the History Server. ### Why are the changes needed? Other YARN-based Spark processes set the `CallerContext`, so that additional auditing context propagates in Hadoop RPC calls. This PR provides auditing context for calls from the History Server. Other callers provide additional information like app ID, attempt ID, etc. We don't provide that here through History Server, which serves multiple apps/attempts. ### Does this PR introduce _any_ user-facing change? Yes. In environments that configure `hadoop.caller.context.enabled=true`, users will now see additional information in the HDFS audit logs explicitly stating that calls originated from the History Server. ### How was this patch tested? A new unit test has been added. All tests pass in the history package. ``` build/mvn -pl core test -Dtest=none -DmembersOnlySuites=org.apache.spark.deploy.history ``` When the changes are deployed to a running cluster, the new caller context is visible in the HDFS audit logs. ``` 2025-02-07 23:00:54,657 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem.audit: allowed=true ugi=spark (auth:SIMPLE) ip=/10.240.5.205 cmd=open src=/133bcb94-52b8-4356-ad9b-7358c78ce7fd/spark-job-history/application_1738779819434_0012 dst=null perm=null proto=rpc callerContext=SPARK_HISTORY 2025-02-07 23:00:54,683 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem.audit: allowed=true ugi=spark (auth:SIMPLE) ip=/10.240.5.205 cmd=open src=/133bcb94-52b8-4356-ad9b-7358c78ce7fd/spark-job-history/application_1738779819434_0011 dst=null perm=null proto=rpc callerContext=SPARK_HISTORY 2025-02-07 23:00:54,699 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem.audit: allowed=true ugi=spark (auth:SIMPLE) ip=/10.240.5.205 cmd=open src=/133bcb94-52b8-4356-ad9b-7358c78ce7fd/spark-job-history/application_1738779819434_0011 dst=null perm=null proto=rpc callerContext=SPARK_HISTORY 2025-02-07 23:00:54,715 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem.audit: allowed=true ugi=spark (auth:SIMPLE) ip=/10.240.5.205 cmd=open src=/133bcb94-52b8-4356-ad9b-7358c78ce7fd/spark-job-history/application_1738779819434_0010 dst=null perm=null proto=rpc callerContext=SPARK_HISTORY 2025-02-07 23:00:54,729 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem.audit: allowed=true ugi=spark (auth:SIMPLE) ip=/10.240.5.205 cmd=open src=/133bcb94-52b8-4356-ad9b-7358c78ce7fd/spark-job-history/application_1738779819434_0010 dst=null perm=null proto=rpc callerContext=SPARK_HISTORY 2025-02-07 23:00:54,743 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem.audit: allowed=true ugi=spark (auth:SIMPLE) ip=/10.240.5.205 cmd=open src=/133bcb94-52b8-4356-ad9b-7358c78ce7fd/spark-job-history/application_1738779819434_0009 dst=null perm=null proto=rpc callerContext=SPARK_HISTORY 2025-02-07 23:00:54,755 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem.audit: allowed=true ugi=spark (auth:SIMPLE) ip=/10.240.5.205 cmd=open src=/133bcb94-52b8-4356-ad9b-7358c78ce7fd/spark-job-history/application_1738779819434_0009 dst=null perm=null proto=rpc callerContext=SPARK_HISTORY 2025-02-07 23:00:54,767 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem.audit: allowed=true ugi=spark (auth:SIMPLE) ip=/10.240.5.205 cmd=open src=/133bcb94-52b8-4356-ad9b-7358c78ce7fd/spark-job-history/application_1738779819434_0008 dst=null perm=null proto=rpc callerContext=SPARK_HISTORY 2025-02-07 23:00:54,779 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem.audit: allowed=true ugi=spark (auth:SIMPLE) ip=/10.240.5.205 cmd=open src=/133bcb94-52b8-4356-ad9b-7358c78ce7fd/spark-job-history/application_1738779819434_0008 dst=null perm=null proto=rpc callerContext=SPARK_HISTORY 2025-02-07 23:01:04,160 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem.audit: allowed=true ugi=spark (auth:SIMPLE) ip=/10.240.5.205 cmd=listStatus src=/133bcb94-52b8-4356-ad9b-7358c78ce7fd/spark-job-history dst=null perm=null proto=rpc callerContext=SPARK_HISTORY ``` ### Was this patch authored or co-authored using generative AI tooling? No.

cnauroth · 2025-02-12T00:27:06Z

@dongjoon-hyun , I rebased to current master and removed my test helper. Thank you.

dongjoon-hyun

+1, LGTM. (Pending CIs).

dongjoon-hyun · 2025-02-12T01:49:29Z

All core module tests passed. Merged to master for Apache Spark 4.1.0.

cnauroth · 2025-02-12T03:13:48Z

Great, really appreciate your efforts on testing strategy for this change! Thank you, @dongjoon-hyun .

github-actions bot added the CORE label Feb 9, 2025

cnauroth commented Feb 9, 2025

View reviewed changes

dongjoon-hyun changed the title ~~[SPARK-51136][HISTORYSERVER] Set CallerContext for History Server~~ [SPARK-51136][CORE] Set CallerContext for History Server Feb 10, 2025

dongjoon-hyun reviewed Feb 10, 2025

View reviewed changes

cnauroth force-pushed the SPARK-51136-master branch from f919079 to b1bdd51 Compare February 11, 2025 00:57

dongjoon-hyun reviewed Feb 11, 2025

View reviewed changes

cnauroth mentioned this pull request Feb 11, 2025

[SPARK-51164][CORE][TESTS] Fix CallerContext test by enabling hadoop.caller.context.enabled #49893

Closed

cnauroth force-pushed the SPARK-51136-master branch from b1bdd51 to 0240cca Compare February 12, 2025 00:26

dongjoon-hyun approved these changes Feb 12, 2025

View reviewed changes

dongjoon-hyun closed this in 00b2bc4 Feb 12, 2025

[SPARK-51136][CORE] Set CallerContext for History Server #49858

[SPARK-51136][CORE] Set CallerContext for History Server #49858

Uh oh!

Conversation

cnauroth commented Feb 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cnauroth commented Feb 9, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dongjoon-hyun left a comment

Choose a reason for hiding this comment

Uh oh!

cnauroth commented Feb 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dongjoon-hyun commented Feb 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

cnauroth commented Feb 11, 2025

Uh oh!

dongjoon-hyun commented Feb 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

cnauroth commented Feb 11, 2025

Uh oh!

dongjoon-hyun left a comment

Choose a reason for hiding this comment

Uh oh!

dongjoon-hyun commented Feb 11, 2025

Uh oh!

cnauroth commented Feb 12, 2025

Uh oh!

dongjoon-hyun left a comment

Choose a reason for hiding this comment

Uh oh!

dongjoon-hyun commented Feb 12, 2025

Uh oh!

cnauroth commented Feb 12, 2025

Uh oh!

Uh oh!

[SPARK-51136][CORE] Set `CallerContext` for History Server #49858

[SPARK-51136][CORE] Set `CallerContext` for History Server #49858

cnauroth commented Feb 9, 2025 •

edited

Loading

cnauroth commented Feb 10, 2025 •

edited

Loading

dongjoon-hyun commented Feb 10, 2025 •

edited

Loading

dongjoon-hyun commented Feb 11, 2025 •

edited

Loading