detect and clearly report OOM error and improve python UDF error reporting #890

jdries · 2024-10-04T09:29:53Z

I had a batch job with these final errors in the editor log viewer:

OpenEO batch job failed: Exception during Spark execution: org.apache.spark.SparkException: Job aborted due to stage failure: ShuffleMapStage 8 (load_collection: read by input product) has failed the maximum allowable number of times: 4. Most recent failure reason...

Batch job error stack trace with locals (followed by nothing at all)

Previous errors were very clear:

 ExceptionFailure(java.lang.OutOfMemoryError,Java heap space,[Ljava.lang.StackTraceElement;@5f4b7e8c,java.lang.OutOfMemoryError: Java heap space at

Lost executor 9 on 10.42.194.120: The executor with id 9 exited with exit code 52(JVM OOM). The API gave the following container statuses: container name: spark-kubernetes-executor container image: registry.stag.warsaw.openeo.dataspace.copernicus.eu/staging/openeo-geotrellis-kube:20241003-1972 container state: terminated container started at: 2024-10-03T18:10:58Z

It should really be possible to detect these errors, and as a final message, give the advice to increate executor-memory.

Example job: j-2410038ebc994cb9a485ee653e1136cb

The text was updated successfully, but these errors were encountered:

jdries · 2025-02-19T10:48:22Z

A case coming from support: r-2502141707054cd4b2d662309d86b12e
User got the not usable message:
OpenEoApiError: [500] Internal: Server error: Exception during Spark execution: java.io.EOFException
For synchronous requests, an alternative could be:
The synchronous request failed, potentially due to memory issues. Try running the request as a batch job, which has more possibilities for tuning job settings in case the defaults also result in this error.

Very relevant logging also:
exception chain classes: org.apache.spark.SparkException caused by org.apache.spark.SparkException caused by java.io.EOFException

This suggests we need to add something right here:

openeo-geopyspark-driver/openeogeotrellis/backend.py

Line 1109 in db2f9f3

else:

Full stack trace:

Traceback (most recent call last):
  File "/opt/openeo/lib/python3.8/site-packages/flask/app.py", line 880, in full_dispatch_request
    rv = self.dispatch_request()
  File "/opt/openeo/lib/python3.8/site-packages/flask/app.py", line 865, in dispatch_request
    return self.ensure_sync(self.view_functions[rule.endpoint])(**view_args)  # type: ignore[no-any-return]
  File "/opt/openeo/lib/python3.8/site-packages/openeo_driver/users/auth.py", line 95, in decorated
    return f(*args, **kwargs)
  File "/opt/openeo/lib/python3.8/site-packages/openeo_driver/views.py", line 705, in result
    response = result.create_flask_response()
  File "/opt/openeo/lib/python3.8/site-packages/openeo_driver/save_result.py", line 179, in create_flask_response
    filename = self.save_result(filename)
  File "/opt/openeo/lib/python3.8/site-packages/openeo_driver/save_result.py", line 159, in save_result
    return self.cube.save_result(filename=filename, format=self.format, format_options=self.options)
  File "/opt/openeo/lib/python3.8/site-packages/openeogeotrellis/geopysparkdatacube.py", line 91, in run
    return func(*args, **kwargs)
  File "/opt/openeo/lib/python3.8/site-packages/openeogeotrellis/geopysparkdatacube.py", line 1743, in save_result
    result = self.write_assets(filename, format, format_options)
  File "/opt/openeo/lib/python3.8/site-packages/openeogeotrellis/geopysparkdatacube.py", line 2171, in write_assets
    asset_paths = get_jvm().org.openeo.geotrellis.netcdf.NetCDFRDDWriter.writeRasters(
  File "/usr/local/spark/python/lib/py4j-0.10.9.7-src.zip/py4j/java_gateway.py", line 1322, in __call__
    return_value = get_return_value(
  File "/usr/local/spark/python/lib/py4j-0.10.9.7-src.zip/py4j/protocol.py", line 326, in get_return_value
    raise Py4JJavaError(
py4j.protocol.Py4JJavaError: An error occurred while calling z:org.openeo.geotrellis.netcdf.NetCDFRDDWriter.writeRasters.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 5 in stage 892.0 failed 4 times, most recent failure: Lost task 5.3 in stage 892.0 (TID 65562) (10.42.46.115 executor 108): org.apache.spark.SparkException: Python worker exited unexpectedly (crashed)
	at org.apache.spark.api.python.BasePythonRunner$ReaderIterator$$anonfun$1.applyOrElse(PythonRunner.scala:601)
	at org.apache.spark.api.python.BasePythonRunner$ReaderIterator$$anonfun$1.applyOrElse(PythonRunner.scala:583)
	at scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:38)
	at org.apache.spark.api.python.PythonRunner$$anon$3.read(PythonRunner.scala:772)
	at org.apache.spark.api.python.PythonRunner$$anon$3.read(PythonRunner.scala:749)
	at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.hasNext(PythonRunner.scala:514)
	at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
	at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460)
	at org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:140)
	at org.apache.spark.shuffle.ShuffleWriteProcessor.write(ShuffleWriteProcessor.scala:59)
	at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:101)
	at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53)
	at org.apache.spark.TaskContext.runTaskWithListeners(TaskContext.scala:161)
	at org.apache.spark.scheduler.Task.run(Task.scala:139)
	at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:554)
	at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1529)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:557)
	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
	at java.base/java.lang.Thread.run(Thread.java:829)
Caused by: java.io.EOFException
	at java.base/java.io.DataInputStream.readInt(DataInputStream.java:397)
	at org.apache.spark.api.python.PythonRunner$$anon$3.read(PythonRunner.scala:757)
	... 16 more

Driver stacktrace:
	at org.apache.spark.scheduler.DAGScheduler.failJobAndIndependentStages(DAGScheduler.scala:2785)
	at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2(DAGScheduler.scala:2721)
	at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2$adapted(DAGScheduler.scala:2720)
	at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)
	at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)
	at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)
	at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:2720)
	at org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1(DAGScheduler.scala:1206)
	at org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1$adapted(DAGScheduler.scala:1206)
	at scala.Option.foreach(Option.scala:407)
	at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:1206)
	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:2984)
	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2923)
	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2912)
	at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49)
	at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:971)
	at org.apache.spark.SparkContext.runJob(SparkContext.scala:2263)
	at org.apache.spark.SparkContext.runJob(SparkContext.scala:2284)
	at org.apache.spark.SparkContext.runJob(SparkContext.scala:2303)
	at org.apache.spark.SparkContext.runJob(SparkContext.scala:2328)
	at org.apache.spark.rdd.RDD.count(RDD.scala:1266)
	at org.openeo.geotrellis.netcdf.NetCDFRDDWriter$.cacheAndRepartition(NetCDFRDDWriter.scala:267)
	at org.openeo.geotrellis.netcdf.NetCDFRDDWriter$.saveSingleNetCDFGeneric(NetCDFRDDWriter.scala:126)
	at org.openeo.geotrellis.netcdf.NetCDFRDDWriter$.saveSingleNetCDFGeneric(NetCDFRDDWriter.scala:108)
	at org.openeo.geotrellis.netcdf.NetCDFRDDWriter$.writeRasters(NetCDFRDDWriter.scala:80)
	at org.openeo.geotrellis.netcdf.NetCDFRDDWriter.writeRasters(NetCDFRDDWriter.scala)
	at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.base/java.lang.reflect.Method.invoke(Method.java:566)
	at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
	at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:374)
	at py4j.Gateway.invoke(Gateway.java:282)
	at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
	at py4j.commands.CallCommand.execute(CallCommand.java:79)
	at py4j.ClientServerConnection.waitForCommands(ClientServerConnection.java:182)
	at py4j.ClientServerConnection.run(ClientServerConnection.java:106)
	at java.base/java.lang.Thread.run(Thread.java:829)
Caused by: org.apache.spark.SparkException: Python worker exited unexpectedly (crashed)
	at org.apache.spark.api.python.BasePythonRunner$ReaderIterator$$anonfun$1.applyOrElse(PythonRunner.scala:601)
	at org.apache.spark.api.python.BasePythonRunner$ReaderIterator$$anonfun$1.applyOrElse(PythonRunner.scala:583)
	at scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:38)
	at org.apache.spark.api.python.PythonRunner$$anon$3.read(PythonRunner.scala:772)
	at org.apache.spark.api.python.PythonRunner$$anon$3.read(PythonRunner.scala:749)
	at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.hasNext(PythonRunner.scala:514)
	at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
	at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460)
	at org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:140)
	at org.apache.spark.shuffle.ShuffleWriteProcessor.write(ShuffleWriteProcessor.scala:59)
	at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:101)
	at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53)
	at org.apache.spark.TaskContext.runTaskWithListeners(TaskContext.scala:161)
	at org.apache.spark.scheduler.Task.run(Task.scala:139)
	at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:554)
	at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1529)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:557)
	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
	... 1 more
Caused by: java.io.EOFException
	at java.base/java.io.DataInputStream.readInt(DataInputStream.java:397)
	at org.apache.spark.api.python.PythonRunner$$anon$3.read(PythonRunner.scala:757)
	... 16 more

EmileSonneveld · 2025-02-19T11:02:21Z

Might be good to suggest user to double check memory settings after an MemoryError: std::bad_alloc error too.

openeo-geopyspark-driver/openeogeotrellis/backend.py

Line 1034 in 44e033e

    
           def summarize_exception_static(error: Exception, width=2000) -> Union[ErrorSummary, Exception]:

JeroenVerstraelen assigned bossie Nov 8, 2024

JeroenVerstraelen changed the title ~~detect and clearly report OOM error~~ detect and clearly report OOM error and improve python UDF error reporting Feb 17, 2025

JeroenVerstraelen assigned EmileSonneveld and unassigned bossie Feb 17, 2025

soxofaan added the UDF label Feb 21, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

detect and clearly report OOM error and improve python UDF error reporting #890

detect and clearly report OOM error and improve python UDF error reporting #890

jdries commented Oct 4, 2024 •

edited

Loading

jdries commented Feb 19, 2025 •

edited

Loading

EmileSonneveld commented Feb 19, 2025

detect and clearly report OOM error and improve python UDF error reporting #890

detect and clearly report OOM error and improve python UDF error reporting #890

Comments

jdries commented Oct 4, 2024 • edited Loading

jdries commented Feb 19, 2025 • edited Loading

EmileSonneveld commented Feb 19, 2025

jdries commented Oct 4, 2024 •

edited

Loading

jdries commented Feb 19, 2025 •

edited

Loading