streamline batch job spark-submit #197

bossie · 2022-08-12T09:48:34Z

This morning users were unable to submit batch jobs (all environments). The logging in Kibana stopped at the point where a spark-submit was about to be called:

Submitting job: ['/opt/venv/lib64/python3.8/site-packages/openeogeotrellis/deploy/submit_batch_job_spark3.sh', 'openEO batch_cog_j-e73e3716d4a3426ea6b12904c4184f53_user vdboschj' ...

This suggested that the submit_batch_job_spark3.sh script itself seemed to hang but there doesn't seem a way to find out exactly why because the application will only log the output of this child process when the child process finishes.

cat /proc/<pid>/fd/1 or 2 didn't work so I resorted to docker exec'ing into the web app container and running the submit_batch_job_spark3.sh script manually (all of its parameters are also in the log above); its output was this:

22/08/12 07:48:29 INFO ZooKeeper: Initiating client connection, connectString=epod-master1.vgt.vito.be:2181,epod-master2.vgt.vito.be:2181,epod-master3.vgt.vito.be:2181 sessionTimeout=50000 watcher=org.apache.accumulo.fate.zookeeper.ZooSession$ZooWatcher@474e34e4
22/08/12 07:48:29 INFO X509Util: Setting -D jdk.tls.rejectClientInitiatedRenegotiation=true to disable client-initiated TLS renegotiation
22/08/12 07:48:29 INFO ClientCnxnSocket: jute.maxbuffer value is 1048575 Bytes
22/08/12 07:48:29 INFO ClientCnxn: zookeeper.request.timeout value is 0. feature enabled=false
22/08/12 07:48:29 INFO ClientCnxn: Opening socket connection to server epod-master2.vgt.vito.be/192.168.207.57:2181.
22/08/12 07:48:29 INFO ClientCnxn: SASL config status: Will not attempt to authenticate using SASL (unknown error)
22/08/12 07:48:29 INFO ClientCnxn: Socket connection established, initiating session, client: /192.168.207.55:38898, server: epod-master2.vgt.vito.be/192.168.207.57:2181
22/08/12 07:48:29 INFO ClientCnxn: Session establishment complete on server epod-master2.vgt.vito.be/192.168.207.57:2181, session id = 0x2817757a3e869ea, negotiated timeout = 40000
22/08/12 07:48:30 WARN ServerClient: There are no tablet servers: check that zookeeper and accumulo are running.

It seemed to hang indefinitely and only proceeded once the Accumulo tablet servers (which were indeed all down) were restarted.

Tasks:

Improve the debuggability of the submit script; maybe we can capture its output while the script is still running but a better solution might be to prevent it from hanging indefinitely: subprocess.check_output has a timeout parameter that we could e.g. set to match the timeout of the proxy before it responds with a 502 to the client's start-batch-job request.
Find out why spark-submit attempts to contact Accumulo in the first place; is this necessary?
Apparently it takes quite a bit of time from the client's point of view to start a batch job because it does involve some non-trivial work (e.g. spark-submit); can we do this asynchronously? We'll still want to put a timeout on the spark-submit although it can be quite a bit longer than the proxy timeout.

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

streamline batch job spark-submit #197

streamline batch job spark-submit #197

bossie commented Aug 12, 2022

streamline batch job spark-submit #197

streamline batch job spark-submit #197

Comments

bossie commented Aug 12, 2022