Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-51243][CORE][ML] Configurable allow native BLAS #49986

Open
wants to merge 5 commits into
base: master
Choose a base branch
from

Conversation

pan3793
Copy link
Member

@pan3793 pan3793 commented Feb 17, 2025

What changes were proposed in this pull request?

This PR proposes introducing a new configuration spark.ml.allowNativeBlas, when setting to false, Spark always uses Java BLAS even when the native library openblas or mkl is available on the machine.

Why are the changes needed?

Currently, there are many places in the Spark codebase hardcoded to call BLAS.nativeBLAS, when NativeBLAS is available, it always uses JNIBLAS, this generally is a good idea, but I found some negative cases in our internal ML workloads, that JavaBLAS is faster than JNIBLAS, this might be caused by the native library (e.g. mkl or openblas) bugs or does not optimized for some hardware. Given that, I think we should allow users to disable NativeBLAS explicitly.

The proposed spark.ml.allowNativeBlas configuration does not strictly follow the Spark configuration system because the sparkConf is not always available on the caller side of BLAS.nativeBLAS

Does this PR introduce any user-facing change?

Add a new feature, but the default value keeps the current behavior, also update the docs to mention the added conf.

How was this patch tested?

Manual tests, will supply more context after reaching a consensus on how to expose the configuration to end users.

Was this patch authored or co-authored using generative AI tooling?

No.

@pan3793
Copy link
Member Author

pan3793 commented Feb 17, 2025

the current approach works with spark-submit

spark-submit --conf spark.ml.allowNativeBlas=false ...

user should modify their Java command options for cases that create embedded SparkContext in the Java app

java -Dnetlib.allowNativeBlas=false ...

@pan3793
Copy link
Member Author

pan3793 commented Feb 17, 2025

cc @zhengruifeng @panbingkun, could you please take a look? and do you have a better idea of how to implement the configuration?

@zhengruifeng
Copy link
Contributor

I think this PR needs reviews from @srowen @WeichenXu123 and @luhenry

@@ -3436,6 +3437,20 @@ object SparkContext extends Logging {
supplement(DRIVER_JAVA_OPTIONS)
supplement(EXECUTOR_JAVA_OPTIONS)
}

private def supplementBlasOptions(conf: SparkConf): Unit = {
conf.getOption("spark.ml.allowNativeBlas").foreach { allowNativeBlas =>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure whether we can use a env variable like MKL_NUM_THREADS=1 OPENBLAS_NUM_THREADS=1

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I skimmed the codebase https://github.com/luhenry/netlib and found neither sys prop nor env to disable native blas loading, so I need to introduce a new one. do you mean env var is preferred over sys prop?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess I mean, does this have to propagate via sys property at all versus just being used as a config in the code directly? But probably it is hard to plumb through access to the config object

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@srowen I replied in #49986 (comment)


private def supplementBlasOptions(conf: SparkConf): Unit = {
conf.getOption("spark.ml.allowNativeBlas").foreach { allowNativeBlas =>
def supplement(key: OptionalConfigEntry[String]): Unit = {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is repeated so many times now - I wonder if a simple refactor is in order to have one 'supplement' function?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for suggestion, will do

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

refactored by extracting a common method supplementJavaOpts

@@ -1049,6 +1049,10 @@ private[spark] class Client(

javaOpts += s"-Djava.net.preferIPv6Addresses=${Utils.preferIPv6}"

sparkConf.getOption("spark.ml.allowNativeBlas").foreach { allowNativeBlas =>
javaOpts += s"-Dspark.ml.allowNativeBlas=$allowNativeBlas"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do other resource managers like k8s need this? not sure

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the impl almost follows how we process java.net.preferIPv6Addresses, I will do test on K8s and back here.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do other resource managers like k8s need this? not sure

K8s does not need that change.

Spark on YARN

the code appends -Dspark.ml.allowNativeBlas=... to YARN AM process command, we should assemble a java command to let YARN RM know how to bootstrap the process

Spark on K8s

  • client mode, no driver Pod
  • cluster mode, run spark-submit(which carries all Java options from local spark-submit) in the driver Pod

so it does not need to append those Java options again.

@@ -39,8 +39,11 @@ private[spark] object BLAS extends Serializable {
// For level-3 routines, we use the native BLAS.
private[spark] def nativeBLAS: NetlibBLAS = {
if (_nativeBLAS == null) {
_nativeBLAS =
try { NetlibNativeBLAS.getInstance } catch { case _: Throwable => javaBLAS }
_nativeBLAS = System.getProperty("spark.ml.allowNativeBlas", "true") match {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This has to be a sys property because of how early it has to be initialized?

Copy link
Member Author

@pan3793 pan3793 Feb 18, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ideally, I think we should propagate the conf via SparkConf, and change the method signature to

- def nativeBLAS: NetlibBLAS
+ def nativeBLAS(allowNative: Boolean): NetlibBLAS

but I found many places call BLAS.nativeBLAS where SparkConf is unavailable, so I propose to use a sys property.

@@ -46,8 +46,7 @@ The installation should be done on all nodes of the cluster. Generic version of

For Debian / Ubuntu:
```
sudo apt-get install libopenblas-base
sudo update-alternatives --config libblas.so.3
sudo apt-get install libopenblas-dev
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

libopenblas-base is removed in Debian 12 and Ubuntu 24.04, libopenblas-dev should be used instead.

https://github.com/luhenry/netlib/blob/6835050840ead1a724e2f305875c92d7cc21f834/.github/workflows/build-and-test.yml#L31

also update-alternatives --config libblas.so.3 does not work, and it's variant in different CPU-arch OS.

root@0bef5c80cdaa:/# update-alternatives --config libblas.so.3
update-alternatives: error: no alternatives for libblas.so.3

Given it already allows using -Ddev.ludovic.netlib.lapack.nativeLib=... to choose the native libraries, I would suggest eliminating how to use alternatives to manage the OS default library in our docs.


/**
* BLAS routines for MLlib's vectors and matrices.
*/
private[spark] object BLAS extends Serializable with Logging {

@transient private var _javaBLAS: NetlibBLAS = _
@transient private var _nativeBLAS: NetlibBLAS = _
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove the duplicated instance creation and call org.apache.spark.ml.linalg.BLAS

/**
* ARPACK routines for MLlib's vectors and matrices.
*/
private[spark] object ARPACK extends Serializable {
private[spark] object ARPACK extends Serializable with Logging {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should I move ARPACK and LAPACK to mllib-local, to align with BLAS?

@pan3793 pan3793 marked this pull request as ready for review February 18, 2025 16:10
@pan3793
Copy link
Member Author

pan3793 commented Feb 18, 2025

@srowen I addressed your previous comments(replied inline in each thread) and found some other issues during work on this PR:

  1. spark.executor.extraJavaOptions forbids shipping value contains -Dspark, this also affects SPARK-47383 ([SPARK-47383][CORE] Support spark.shutdown.timeout config #45504 (comment)). To fix that, we can either give an exception to spark.ml.allowNativeBlas when checking spark.executor.extraJavaOptions or choose a different prefix, e.g. netlib.allowNativeBlas

  2. In addition to mllib-local/src/main/scala/org/apache/spark/ml/linalg/BLAS.scala, I found other three places that need to modify

    • mllib/src/main/scala/org/apache/spark/mllib/linalg/BLAS.scala
    • mllib/src/main/scala/org/apache/spark/mllib/linalg/ARPACK.scala
    • mllib/src/main/scala/org/apache/spark/mllib/linalg/LAPACK.scala

    I'm not sure if we want to have 3 configurations(e.g. spark.ml.allowNativeBlas, spark.ml.allowNativeArpack, spark.ml.allowNativeLapack) or just a single one spark.ml.allowNativeBlas(or a better name?)

  3. the official Spark docker image does not install the BLAS native libraries, should we install them?

would be great if you could have another look, thank you in advance.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants