Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

streamline batch job spark-submit #197

Open
3 tasks
bossie opened this issue Aug 12, 2022 · 0 comments
Open
3 tasks

streamline batch job spark-submit #197

bossie opened this issue Aug 12, 2022 · 0 comments

Comments

@bossie
Copy link
Collaborator

bossie commented Aug 12, 2022

This morning users were unable to submit batch jobs (all environments). The logging in Kibana stopped at the point where a spark-submit was about to be called:

Submitting job: ['/opt/venv/lib64/python3.8/site-packages/openeogeotrellis/deploy/submit_batch_job_spark3.sh', 'openEO batch_cog_j-e73e3716d4a3426ea6b12904c4184f53_user vdboschj' ...

This suggested that the submit_batch_job_spark3.sh script itself seemed to hang but there doesn't seem a way to find out exactly why because the application will only log the output of this child process when the child process finishes.

cat /proc/<pid>/fd/1 or 2 didn't work so I resorted to docker exec'ing into the web app container and running the submit_batch_job_spark3.sh script manually (all of its parameters are also in the log above); its output was this:

22/08/12 07:48:29 INFO ZooKeeper: Initiating client connection, connectString=epod-master1.vgt.vito.be:2181,epod-master2.vgt.vito.be:2181,epod-master3.vgt.vito.be:2181 sessionTimeout=50000 watcher=org.apache.accumulo.fate.zookeeper.ZooSession$ZooWatcher@474e34e4
22/08/12 07:48:29 INFO X509Util: Setting -D jdk.tls.rejectClientInitiatedRenegotiation=true to disable client-initiated TLS renegotiation
22/08/12 07:48:29 INFO ClientCnxnSocket: jute.maxbuffer value is 1048575 Bytes
22/08/12 07:48:29 INFO ClientCnxn: zookeeper.request.timeout value is 0. feature enabled=false
22/08/12 07:48:29 INFO ClientCnxn: Opening socket connection to server epod-master2.vgt.vito.be/192.168.207.57:2181.
22/08/12 07:48:29 INFO ClientCnxn: SASL config status: Will not attempt to authenticate using SASL (unknown error)
22/08/12 07:48:29 INFO ClientCnxn: Socket connection established, initiating session, client: /192.168.207.55:38898, server: epod-master2.vgt.vito.be/192.168.207.57:2181
22/08/12 07:48:29 INFO ClientCnxn: Session establishment complete on server epod-master2.vgt.vito.be/192.168.207.57:2181, session id = 0x2817757a3e869ea, negotiated timeout = 40000
22/08/12 07:48:30 WARN ServerClient: There are no tablet servers: check that zookeeper and accumulo are running.

It seemed to hang indefinitely and only proceeded once the Accumulo tablet servers (which were indeed all down) were restarted.

Tasks:

  • Improve the debuggability of the submit script; maybe we can capture its output while the script is still running but a better solution might be to prevent it from hanging indefinitely: subprocess.check_output has a timeout parameter that we could e.g. set to match the timeout of the proxy before it responds with a 502 to the client's start-batch-job request.
  • Find out why spark-submit attempts to contact Accumulo in the first place; is this necessary?
  • Apparently it takes quite a bit of time from the client's point of view to start a batch job because it does involve some non-trivial work (e.g. spark-submit); can we do this asynchronously? We'll still want to put a timeout on the spark-submit although it can be quite a bit longer than the proxy timeout.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant