-
Notifications
You must be signed in to change notification settings - Fork 28.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-51243][CORE][ML] Configurable allow native BLAS #49986
base: master
Are you sure you want to change the base?
Conversation
the current approach works with
user should modify their Java command options for cases that create embedded
|
cc @zhengruifeng @panbingkun, could you please take a look? and do you have a better idea of how to implement the configuration? |
I think this PR needs reviews from @srowen @WeichenXu123 and @luhenry |
@@ -3436,6 +3437,20 @@ object SparkContext extends Logging { | |||
supplement(DRIVER_JAVA_OPTIONS) | |||
supplement(EXECUTOR_JAVA_OPTIONS) | |||
} | |||
|
|||
private def supplementBlasOptions(conf: SparkConf): Unit = { | |||
conf.getOption("spark.ml.allowNativeBlas").foreach { allowNativeBlas => |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure whether we can use a env variable like MKL_NUM_THREADS=1
OPENBLAS_NUM_THREADS=1
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I skimmed the codebase https://github.com/luhenry/netlib and found neither sys prop nor env to disable native blas loading, so I need to introduce a new one. do you mean env var is preferred over sys prop?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess I mean, does this have to propagate via sys property at all versus just being used as a config in the code directly? But probably it is hard to plumb through access to the config object
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@srowen I replied in #49986 (comment)
|
||
private def supplementBlasOptions(conf: SparkConf): Unit = { | ||
conf.getOption("spark.ml.allowNativeBlas").foreach { allowNativeBlas => | ||
def supplement(key: OptionalConfigEntry[String]): Unit = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is repeated so many times now - I wonder if a simple refactor is in order to have one 'supplement' function?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks for suggestion, will do
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
refactored by extracting a common method supplementJavaOpts
@@ -1049,6 +1049,10 @@ private[spark] class Client( | |||
|
|||
javaOpts += s"-Djava.net.preferIPv6Addresses=${Utils.preferIPv6}" | |||
|
|||
sparkConf.getOption("spark.ml.allowNativeBlas").foreach { allowNativeBlas => | |||
javaOpts += s"-Dspark.ml.allowNativeBlas=$allowNativeBlas" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do other resource managers like k8s need this? not sure
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the impl almost follows how we process java.net.preferIPv6Addresses
, I will do test on K8s and back here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do other resource managers like k8s need this? not sure
K8s does not need that change.
Spark on YARN
the code appends -Dspark.ml.allowNativeBlas=...
to YARN AM process command, we should assemble a java command to let YARN RM know how to bootstrap the process
Spark on K8s
- client mode, no driver Pod
- cluster mode, run
spark-submit
(which carries all Java options from localspark-submit
) in the driver Pod
so it does not need to append those Java options again.
@@ -39,8 +39,11 @@ private[spark] object BLAS extends Serializable { | |||
// For level-3 routines, we use the native BLAS. | |||
private[spark] def nativeBLAS: NetlibBLAS = { | |||
if (_nativeBLAS == null) { | |||
_nativeBLAS = | |||
try { NetlibNativeBLAS.getInstance } catch { case _: Throwable => javaBLAS } | |||
_nativeBLAS = System.getProperty("spark.ml.allowNativeBlas", "true") match { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This has to be a sys property because of how early it has to be initialized?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ideally, I think we should propagate the conf via SparkConf
, and change the method signature to
- def nativeBLAS: NetlibBLAS
+ def nativeBLAS(allowNative: Boolean): NetlibBLAS
but I found many places call BLAS.nativeBLAS
where SparkConf
is unavailable, so I propose to use a sys property.
@@ -46,8 +46,7 @@ The installation should be done on all nodes of the cluster. Generic version of | |||
|
|||
For Debian / Ubuntu: | |||
``` | |||
sudo apt-get install libopenblas-base | |||
sudo update-alternatives --config libblas.so.3 | |||
sudo apt-get install libopenblas-dev |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
libopenblas-base
is removed in Debian 12 and Ubuntu 24.04, libopenblas-dev
should be used instead.
also update-alternatives --config libblas.so.3
does not work, and it's variant in different CPU-arch OS.
root@0bef5c80cdaa:/# update-alternatives --config libblas.so.3
update-alternatives: error: no alternatives for libblas.so.3
Given it already allows using -Ddev.ludovic.netlib.lapack.nativeLib=...
to choose the native libraries, I would suggest eliminating how to use alternatives
to manage the OS default library in our docs.
|
||
/** | ||
* BLAS routines for MLlib's vectors and matrices. | ||
*/ | ||
private[spark] object BLAS extends Serializable with Logging { | ||
|
||
@transient private var _javaBLAS: NetlibBLAS = _ | ||
@transient private var _nativeBLAS: NetlibBLAS = _ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
remove the duplicated instance creation and call org.apache.spark.ml.linalg.BLAS
/** | ||
* ARPACK routines for MLlib's vectors and matrices. | ||
*/ | ||
private[spark] object ARPACK extends Serializable { | ||
private[spark] object ARPACK extends Serializable with Logging { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should I move ARPACK
and LAPACK
to mllib-local
, to align with BLAS
?
@srowen I addressed your previous comments(replied inline in each thread) and found some other issues during work on this PR:
would be great if you could have another look, thank you in advance. |
What changes were proposed in this pull request?
This PR proposes introducing a new configuration
spark.ml.allowNativeBlas
, when setting tofalse
, Spark always uses Java BLAS even when the native libraryopenblas
ormkl
is available on the machine.Why are the changes needed?
Currently, there are many places in the Spark codebase hardcoded to call
BLAS.nativeBLAS
, whenNativeBLAS
is available, it always usesJNIBLAS
, this generally is a good idea, but I found some negative cases in our internal ML workloads, thatJavaBLAS
is faster thanJNIBLAS
, this might be caused by the native library (e.g.mkl
oropenblas
) bugs or does not optimized for some hardware. Given that, I think we should allow users to disableNativeBLAS
explicitly.The proposed
spark.ml.allowNativeBlas
configuration does not strictly follow the Spark configuration system because thesparkConf
is not always available on the caller side ofBLAS.nativeBLAS
Does this PR introduce any user-facing change?
Add a new feature, but the default value keeps the current behavior, also update the docs to mention the added conf.
How was this patch tested?
Manual tests, will supply more context after reaching a consensus on how to expose the configuration to end users.
Was this patch authored or co-authored using generative AI tooling?
No.