Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

The PostCommit Python Arm job is flaky #30760

Closed
github-actions bot opened this issue Mar 27, 2024 · 27 comments · Fixed by #33849
Closed

The PostCommit Python Arm job is flaky #30760

github-actions bot opened this issue Mar 27, 2024 · 27 comments · Fixed by #33849

Comments

@github-actions
Copy link
Contributor

The PostCommit Python Arm is failing over 50% of the time
Please visit https://github.com/apache/beam/actions/workflows/beam_PostCommit_Python_Arm.yml?query=is%3Afailure+branch%3Amaster to see the logs.

@chamikaramj
Copy link
Contributor

@tvalentyn do we have a good owner for this ?

@ahmedabu98
Copy link
Contributor

I actually can't find a single green run since this test suite was created (back in September)

@tvalentyn
Copy link
Contributor

You may be right, thanks for correction, @ahmedabu98

2024-04-24T12:03:53.0963029Z Please verify that you have permissions to write to the parent directory..
2024-04-24T12:03:53.0964903Z The configuration directory may not be writable. To learn more, see https://cloud.google.com/sdk/docs/configurations#creating_a_configuration
2024-04-24T12:03:53.0968080Z ERROR: (gcloud.auth.docker-helper) Could not create directory [/var/lib/kubelet/pods/573a1844-124b-4e12-bb0f-0325d0f3c3aa/volumes/kubernetes.io~empty-dir/gcloud]: Permission denied.
2024-04-24T12:03:53.0969612Z 
2024-04-24T12:03:53.0970063Z Please verify that you have permissions to write to the parent directory.
2024-04-24T12:03:53.3953756Z #29 pushing layers 1.4s done
2024-04-24T12:03:53.3956208Z #29 ERROR: failed to push us.gcr.io/apache-beam-testing/github-actions/beam_python3.8_sdk:2.57.0-SNAPSHOT: error getting credentials - err: exit status 1, out: ``
2024-04-24T12:03:53.8953735Z ------

cc: @damccorm - do you remember if this suite never worked or the above error is an artifact of GHA migration?

We can reclassify this as part part of ARM backlog work.

@tvalentyn tvalentyn added P2 and removed P1 labels Apr 24, 2024
@damccorm
Copy link
Contributor

@damccorm
Copy link
Contributor

Looks like it went flaky then permared around then

@ahmedabu98
Copy link
Contributor

Ahh my apologies, I was looking at it through a is:failure filter

@volatilemolotov
Copy link
Contributor

So by removing
https://github.com/apache/beam/blob/master/.github/workflows/beam_PostCommit_Python_Arm.yml#L113

I get the test to move along but its still failing on my fork due to some permission with the Healthcare api.
Oauth scope is wrong or something:
https://github.com/volatilemolotov/beam/actions/runs/8820257015/job/24213449686#step:13:13113

@damccorm
Copy link
Contributor

@volatilemolotov could you put up a PR to make that change? Definitely seems like it is getting further.

@svetakvsundhar do you know what scope is missing? Given the normal postcommit python isn't failing, it might just be an issue with your service account specifically?

@volatilemolotov
Copy link
Contributor

Sure, here it is
#31102

@damccorm
Copy link
Contributor

Thanks - merged, lets see what the result on master is

@svetakvsundhar
Copy link
Contributor

@svetakvsundhar do you know what scope is missing? Given the normal postcommit python isn't failing, it might just be an issue with your service account specifically?

+1, it could be a service account specific issue. I'd want to see a couple of more runs of this to see if it's actually an issue. If so, a thought might be to add ["https://www.googleapis.com/auth/cloud-platform"] as a scope manually in the test.

@volatilemolotov
Copy link
Contributor

@damccorm
Copy link
Contributor

Great, thanks @volatilemolotov

Looks like we're still flaky - https://github.com/apache/beam/actions/runs/8843342204/job/24283441647 - but that's an improvement and it looks like a test flake instead of infra

@kennknowles
Copy link
Member

Permared now

@damccorm
Copy link
Contributor

Copy link
Contributor Author

Reopening since the workflow is still flaky

@damccorm
Copy link
Contributor

Fixed by #32530

@github-actions github-actions bot reopened this Nov 5, 2024
Copy link
Contributor Author

github-actions bot commented Nov 5, 2024

Reopening since the workflow is still flaky

@damccorm
Copy link
Contributor

damccorm commented Nov 5, 2024

This is failing because of Dataflow issues, not because of Beam. Dataflow is requesting arm machines in regions where there are none, failing the job. I reopened an internal bug (id 352725422)

@ahmedabu98
Copy link
Contributor

The following tests appear to be consistently failing:

  • apache_beam/ml/inference/sklearn_inference_it_test.py::SklearnInference::test_sklearn_mnist_classification
  • apache_beam/ml/inference/sklearn_inference_it_test.py::SklearnInference::test_sklearn_mnist_classification_large_model
  • apache_beam/ml/inference/sklearn_inference_it_test.py::SklearnInference::test_sklearn_regression

Errors are smilar:
ValueError: numpy.dtype size changed, may indicate binary incompatibility. Expected 96 from C header, got 88 from PyObject

@tvalentyn @damccorm any ideas?

@ahmedabu98
Copy link
Contributor

Python Postcommits are broken for the same reason: https://github.com/apache/beam/actions/workflows/beam_PostCommit_Python.yml

@damccorm
Copy link
Contributor

I'm guessing this is from https://github.com/apache/beam/pull/33658/files where we updated the numpy version to 2.x. See https://stackoverflow.com/questions/40845304/runtimewarning-numpy-dtype-size-changed-may-indicate-binary-incompatibility

It is coming from unpickling the model here -

model_path = 'gs://apache-beam-ml/models/mnist_model_svm.pickle'
- we probably just need to retrain/upload the model

cc/ @liferoad @Amar3tto

@liferoad liferoad assigned Amar3tto and unassigned tvalentyn Jan 24, 2025
@liferoad
Copy link
Contributor

where are the instructions to retrain the model? The model training should be part of tests to avoid maintaining this in the future if the training is cheap.

@liferoad
Copy link
Contributor

I do not think we have the original script to train these models. But https://dmkothari.github.io/Machine-Learning-Projects/SVM_with_MNIST.html should be simple enough for us to update these tests. @Amar3tto

@damccorm
Copy link
Contributor

where are the instructions to retrain the model? The model training should be part of tests to avoid maintaining this in the future if the training is cheap.

I agree - this seems like the right fix. I imagine retraining should be reasonably cheap

@akashorabek akashorabek self-assigned this Jan 27, 2025
@akashorabek
Copy link
Collaborator

akashorabek commented Jan 28, 2025

I tried retraining the model, but the error ValueError: numpy.dtype size changed still appears. Most likely, there is an incompatibility in scikit-learn with the new NumPy version specifically for Python 3.9 and 3.10.
I found a similar issue here, and it seems that downgrading NumPy is the only solution that works. I tried it locally, and it fixed the problem.
Can we revert to an older NumPy version for Python 3.9 and 3.10? I see @Abacn already created draft PR for this.

@damccorm @liferoad

@liferoad
Copy link
Contributor

I think we should either disable this test for Py39 and Py310 or create a Python venv with the numpy 1.xx to run it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

10 participants