You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
This morning users were unable to submit batch jobs (all environments). The logging in Kibana stopped at the point where a spark-submit was about to be called:
This suggested that the submit_batch_job_spark3.sh script itself seemed to hang but there doesn't seem a way to find out exactly why because the application will only log the output of this child process when the child process finishes.
cat /proc/<pid>/fd/1 or 2 didn't work so I resorted to docker exec'ing into the web app container and running the submit_batch_job_spark3.sh script manually (all of its parameters are also in the log above); its output was this:
22/08/12 07:48:29 INFO ZooKeeper: Initiating client connection, connectString=epod-master1.vgt.vito.be:2181,epod-master2.vgt.vito.be:2181,epod-master3.vgt.vito.be:2181 sessionTimeout=50000 watcher=org.apache.accumulo.fate.zookeeper.ZooSession$ZooWatcher@474e34e4
22/08/12 07:48:29 INFO X509Util: Setting -D jdk.tls.rejectClientInitiatedRenegotiation=true to disable client-initiated TLS renegotiation
22/08/12 07:48:29 INFO ClientCnxnSocket: jute.maxbuffer value is 1048575 Bytes
22/08/12 07:48:29 INFO ClientCnxn: zookeeper.request.timeout value is 0. feature enabled=false
22/08/12 07:48:29 INFO ClientCnxn: Opening socket connection to server epod-master2.vgt.vito.be/192.168.207.57:2181.
22/08/12 07:48:29 INFO ClientCnxn: SASL config status: Will not attempt to authenticate using SASL (unknown error)
22/08/12 07:48:29 INFO ClientCnxn: Socket connection established, initiating session, client: /192.168.207.55:38898, server: epod-master2.vgt.vito.be/192.168.207.57:2181
22/08/12 07:48:29 INFO ClientCnxn: Session establishment complete on server epod-master2.vgt.vito.be/192.168.207.57:2181, session id = 0x2817757a3e869ea, negotiated timeout = 40000
22/08/12 07:48:30 WARN ServerClient: There are no tablet servers: check that zookeeper and accumulo are running.
It seemed to hang indefinitely and only proceeded once the Accumulo tablet servers (which were indeed all down) were restarted.
Tasks:
Improve the debuggability of the submit script; maybe we can capture its output while the script is still running but a better solution might be to prevent it from hanging indefinitely: subprocess.check_output has a timeout parameter that we could e.g. set to match the timeout of the proxy before it responds with a 502 to the client's start-batch-job request.
Find out why spark-submit attempts to contact Accumulo in the first place; is this necessary?
Apparently it takes quite a bit of time from the client's point of view to start a batch job because it does involve some non-trivial work (e.g. spark-submit); can we do this asynchronously? We'll still want to put a timeout on the spark-submit although it can be quite a bit longer than the proxy timeout.
The text was updated successfully, but these errors were encountered:
This morning users were unable to submit batch jobs (all environments). The logging in Kibana stopped at the point where a spark-submit was about to be called:
This suggested that the submit_batch_job_spark3.sh script itself seemed to hang but there doesn't seem a way to find out exactly why because the application will only log the output of this child process when the child process finishes.
cat /proc/<pid>/fd/1
or 2 didn't work so I resorted to docker exec'ing into the web app container and running the submit_batch_job_spark3.sh script manually (all of its parameters are also in the log above); its output was this:It seemed to hang indefinitely and only proceeded once the Accumulo tablet servers (which were indeed all down) were restarted.
Tasks:
The text was updated successfully, but these errors were encountered: