[SPARK-51072][CORE] CallerContext to set Hadoop cloud audit context #49779

steveloughran · 2025-02-03T19:19:01Z

What changes were proposed in this pull request?

When enabled, cloud store client audit context is set to the
same context string as the Hadoop IPC context.

Why are the changes needed?

CallerContext adds information about the spark task to hadoop IPC context and then to HDFS, YARN and HBase server logs.

It is also possible to update the cloud storage "audit context".
Storage clients can attach the audit information to requests to be stored in the service's own logs, where it can be retrieved, parsed and used for analysis.

It is currently supported by the S3A connector, which adds the information to a synthetic referrer header, which is then stored in the S3 Server logs. (Not cloudtrail, sadly).

See S3A Auditing

Does this PR introduce any user-facing change?

If enabled, it adds extra entries in cloud storage server logs through cloud
storage clients which support it.

How was this patch tested?

Expanded existing test "Set Spark CallerContext" to verify
full setting of passed down parameters to caller and audit contexts.
This required extracting the functional code of CallerContext.setCurrentContext
into a @VisibleForTesting private[util] method setCurrentContext(Boolean)

Without this, the test suite only ran if the process had been launched
with the configuration option "hadoop.caller.context.enabled being set
to true -this is not the default, so the existing test suite code
was probably never executed.

Was this patch authored or co-authored using generative AI tooling?

No

dongjoon-hyun · 2025-02-03T19:48:41Z

core/src/main/scala/org/apache/spark/util/Utils.scala

+      val hdfsContext = new HadoopCallerContextBuilder(context).build()
+      HadoopCallerContext.setCurrent(hdfsContext)
+      // audit context as passed down to object stores, use prefix "spark"
+      currentAuditContext.put("spark", context)


Thank you for making a PR. This one line seems to be the actual change. Did I understand correctly?

yes! Rest of it is test related

dongjoon-hyun

According to the PR title, do you mean S3 Audit Context feature has been broken until now ? Otherwise, could you revise the PR title by narrowing down the scope more specifically?

[SPARK-51072][CORE] CallerContext to set Hadoop cloud audit context

cnauroth

+1 (non-binding). This will be useful. Thanks, @steveloughran !

cnauroth · 2025-02-03T21:32:01Z

According to the PR title, do you mean S3 Audit Context feature has been broken until now ? Otherwise, could you revise the PR title by narrowing down the scope more specifically?

[SPARK-51072][CORE] CallerContext to set Hadoop cloud audit context

From the perspective of Spark or other clients of the file system, they are interacting with a general auditing feature defined in the Hadoop Common module. In theory, multiple file systems could implement support for actually recording this audit information. AFAIK, only S3A implements it right now. (We don't have it in the GCS file system.) Other file systems could eventually choose to implement it though.

dongjoon-hyun · 2025-02-03T22:27:47Z

Ya, IIUC, without this PR, S3A audit has been working already as designed, hasn't it?

steveloughran · 2025-02-05T11:32:01Z

@dongjoon-hyun audit context has been working properly, but spark info not wired up. Other things get in (process Id, UGI, filesystem id, underlying operation for a sequence), but the actual spark job didn't get in. Once an s3a or manifest committer was started, it'd set the app and job ID values -but they only get involved during the write phase -and they didn't get the full spark context info

steveloughran · 2025-02-05T11:34:21Z

@cnauroth well, you should -if you can get it anywhere into your logs, possibly as a new http header. s3afs attaches as an http referrer as it is the sole entry other than UA which goes into the standard S3 logs -and other things like to set that UA field.

dongjoon-hyun · 2025-02-06T16:14:36Z

Got it. Please let me know when the PR is ready, @steveloughran .

steveloughran · 2025-02-17T15:28:39Z

I can't think of any changes, unless we want to set that audit stuff even if caller context is not being set.

…ntext Change-Id: I6bd66ff817b09c7431e8c6de4577fdda1ed67d6d

Change-Id: Icc3c4c0b599761de92e08946bca2000f3bcb7d6c

cnauroth

Once again putting a +1 (non-binding) on this, now that it has been updated to be consistent with the testing strategy from #49893 and #49898. Thanks again, @steveloughran .

dongjoon-hyun

+1, LGTM. Thank you, @steveloughran and @cnauroth .

Merged to master for Apache Spark 4.1.0.

### What changes were proposed in this pull request? When enabled, cloud store client audit context is set to the same context string as the Hadoop IPC context. ### Why are the changes needed? CallerContext adds information about the spark task to hadoop IPC context and then to HDFS, YARN and HBase server logs. It is also possible to update the cloud storage "audit context". Storage clients can attach the audit information to requests to be stored in the service's own logs, where it can be retrieved, parsed and used for analysis. It is currently supported by the S3A connector, which adds the information to a synthetic referrer header, which is then stored in the S3 Server logs. (Not cloudtrail, sadly). See [S3A Auditing](https://hadoop.apache.org/docs/current/hadoop-aws/tools/hadoop-aws/auditing.html) ### Does this PR introduce _any_ user-facing change? If enabled, it adds extra entries in cloud storage server logs through cloud storage clients which support it. ### How was this patch tested? Expanded existing test `"Set Spark CallerContext"` to verify full setting of passed down parameters to caller and audit contexts. This required extracting the functional code of `CallerContext.setCurrentContext` into a `VisibleForTesting private[util]` method `setCurrentContext(Boolean)` Without this, the test suite only ran if the process had been launched with the configuration option `"hadoop.caller.context.enabled` being set to true -this is not the default, so the existing test suite code was probably never executed. ### Was this patch authored or co-authored using generative AI tooling? No Closes apache#49779 from steveloughran/SPARK-51072-caller-context-auditing. Authored-by: Steve Loughran <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]>

steveloughran marked this pull request as draft February 3, 2025 19:19

github-actions bot added the CORE label Feb 3, 2025

dongjoon-hyun reviewed Feb 3, 2025

View reviewed changes

cnauroth approved these changes Feb 3, 2025

View reviewed changes

cnauroth mentioned this pull request Feb 11, 2025

[SPARK-51136][CORE] Set CallerContext for History Server #49858

Closed

steveloughran mentioned this pull request Feb 17, 2025

[SPARK-51164][CORE][TESTS] Fix CallerContext test by enabling hadoop.caller.context.enabled #49893

Closed

[SPARK-51072][CORE] CallerContext to set Hadoop cloud client audit co…

7671790

…ntext Change-Id: I6bd66ff817b09c7431e8c6de4577fdda1ed67d6d

steveloughran force-pushed the SPARK-51072-caller-context-auditing branch from 5ebeb70 to 7671790 Compare February 17, 2025 15:49

steveloughran marked this pull request as ready for review February 17, 2025 17:03

[SPARK-51072][CORE] checkstyle

086985e

Change-Id: Icc3c4c0b599761de92e08946bca2000f3bcb7d6c

cnauroth approved these changes Feb 25, 2025

View reviewed changes

dongjoon-hyun approved these changes Mar 4, 2025

View reviewed changes

dongjoon-hyun closed this in 229be37 Mar 4, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[SPARK-51072][CORE] CallerContext to set Hadoop cloud audit context #49779

[SPARK-51072][CORE] CallerContext to set Hadoop cloud audit context #49779

Uh oh!

steveloughran commented Feb 3, 2025

Uh oh!

dongjoon-hyun Feb 3, 2025

Uh oh!

steveloughran Feb 5, 2025

Uh oh!

dongjoon-hyun left a comment

Uh oh!

cnauroth left a comment

Uh oh!

cnauroth commented Feb 3, 2025

Uh oh!

dongjoon-hyun commented Feb 3, 2025 •

edited

Loading

Uh oh!

steveloughran commented Feb 5, 2025

Uh oh!

steveloughran commented Feb 5, 2025

Uh oh!

dongjoon-hyun commented Feb 6, 2025

Uh oh!

steveloughran commented Feb 17, 2025

Uh oh!

cnauroth left a comment

Uh oh!

dongjoon-hyun left a comment

Uh oh!

Uh oh!

[SPARK-51072][CORE] CallerContext to set Hadoop cloud audit context #49779

[SPARK-51072][CORE] CallerContext to set Hadoop cloud audit context #49779

Uh oh!

Conversation

steveloughran commented Feb 3, 2025

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

Uh oh!

dongjoon-hyun Feb 3, 2025

Choose a reason for hiding this comment

Uh oh!

steveloughran Feb 5, 2025

Choose a reason for hiding this comment

Uh oh!

dongjoon-hyun left a comment

Choose a reason for hiding this comment

Uh oh!

cnauroth left a comment

Choose a reason for hiding this comment

Uh oh!

cnauroth commented Feb 3, 2025

Uh oh!

dongjoon-hyun commented Feb 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

steveloughran commented Feb 5, 2025

Uh oh!

steveloughran commented Feb 5, 2025

Uh oh!

dongjoon-hyun commented Feb 6, 2025

Uh oh!

steveloughran commented Feb 17, 2025

Uh oh!

cnauroth left a comment

Choose a reason for hiding this comment

Uh oh!

dongjoon-hyun left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

dongjoon-hyun commented Feb 3, 2025 •

edited

Loading