Skip to content

Scenario Failure: ppi #393

@github-actions

Description

@github-actions

Benchmark scenario ID: ppi
Benchmark scenario definition: https://github.com/ESA-APEx/apex_algorithms/blob/91426ff33c094b53df24495e4582d2f8937475ce/algorithm_catalog/vito/ppi/benchmark_scenarios/ppi.json
openEO backend: openeo.dataspace.copernicus.eu

GitHub Actions workflow run: https://github.com/ESA-APEx/apex_algorithms/actions/runs/22934889391
Workflow artifacts: https://github.com/ESA-APEx/apex_algorithms/actions/runs/22934889391#artifacts

Test start: 2026-03-11 03:11:58.656447+00:00
Test duration: 0:11:00.992127
Test outcome: ❌ failed

Last successful test phase: create-job
Failure in test phase: run-job

Contact Information

Name Organization Contact
Victor Verhaert VITO Contact via VITO (VITO Website, GitHub)

Process Graph

{
  "ppi1": {
    "process_id": "ppi",
    "namespace": "https://raw.githubusercontent.com/ESA-APEx/apex_algorithms/refs/heads/main/algorithm_catalog/vito/ppi/openeo_udp/ppi.json",
    "arguments": {
      "temporal_extent": [
        "2022-06-11",
        "2022-06-12"
      ],
      "geometry": {
        "type": "Polygon",
        "coordinates": [
          [
            [
              4.4387,
              50.42624
            ],
            [
              5.9539,
              50.42624
            ],
            [
              5.9539,
              51.4424
            ],
            [
              4.4387,
              51.4424
            ],
            [
              4.4387,
              50.42624
            ]
          ]
        ]
      }
    },
    "result": true
  }
}

Error Logs

scenario = BenchmarkScenario(id='ppi', description='ppi example', backend='openeo.dataspace.copernicus.eu', process_graph={'ppi1'...PosixPath('/home/runner/work/apex_algorithms/apex_algorithms/algorithm_catalog/vito/ppi/benchmark_scenarios/ppi.json'))
connection_factory = <function connection_factory.<locals>.get_connection at 0x7fac50f39d00>
tmp_path = PosixPath('/home/runner/work/apex_algorithms/apex_algorithms/qa/benchmarks/tmp_path_root/test_run_benchmark_ppi_0')
track_metric = <function track_metric.<locals>.track at 0x7fac50f39e40>
track_phase = <function track_phase.<locals>.track at 0x7fac50f39da0>
upload_assets_on_fail = <function upload_assets_on_fail.<locals>.collect at 0x7fac50f39f80>
request = <FixtureRequest for <Function test_run_benchmark[ppi]>>

    @pytest.mark.parametrize(
        "scenario",
        [
            # Use scenario id as parameterization id to give nicer test names.
            pytest.param(uc, id=uc.id)
            for uc in get_benchmark_scenarios()
        ],
    )
    def test_run_benchmark(
        scenario: BenchmarkScenario,
        connection_factory,
        tmp_path: Path,
        track_metric,
        track_phase,
        upload_assets_on_fail,
        request,
    ):
        track_metric("scenario_id", scenario.id)

        with track_phase(phase="connect"):
            # Check if a backend override has been provided via cli options.
            override_backend = request.config.getoption("--override-backend")
            backend_filter = request.config.getoption("--backend-filter")
            if backend_filter and not re.match(backend_filter, scenario.backend):
                # TODO apply filter during scenario retrieval, but seems to be hard to retrieve cli param
                pytest.skip(
                    f"skipping scenario {scenario.id} because backend {scenario.backend} does not match filter {backend_filter!r}"
                )
            backend = scenario.backend
            if override_backend:
                _log.info(f"Overriding backend URL with {override_backend!r}")
                backend = override_backend

            connection: openeo.Connection = connection_factory(url=backend)

        report_path = None

        with track_phase(phase="create-job"):
            # TODO #14 scenario option to use synchronous instead of batch job mode?
            job = connection.create_job(
                process_graph=scenario.process_graph,
                title=f"APEx benchmark {scenario.id}",
                additional=scenario.job_options,
            )
            track_metric("job_id", job.job_id)

            if request.config.getoption("--upload-benchmark-report"):
                report_path = tmp_path / "benchmark_report.json"
                report_path.write_text(json.dumps({
                    "job_id": job.job_id,
                    "scenario_id": scenario.id,
                    "scenario_description": scenario.description,
                    "scenario_backend": scenario.backend,
                    "scenario_source": str(scenario.source) if scenario.source else None,
                    "reference_data": scenario.reference_data,
                    "reference_options": scenario.reference_options,
                }, indent=2))
                upload_assets_on_fail(report_path)

        with track_phase(phase="run-job"):
            # TODO: monitor timing and progress
            # TODO: separate "job started" and run phases?
            max_minutes = request.config.getoption("--maximum-job-time-in-minutes")
            if max_minutes:
                def _timeout_handler(signum, frame):
                    raise TimeoutError(
                        f"Batch job {job.job_id} exceeded maximum allowed time of {max_minutes} minutes"
                    )

                old_handler = signal.signal(signal.SIGALRM, _timeout_handler)
                signal.alarm(max_minutes * 60)
            try:
>               job.start_and_wait()

tests/test_benchmarks.py:96:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

self = <BatchJob job_id='j-2603110312024b9b898c00da6adfae34'>

    def start_and_wait(
        self,
        *,
        print=print,
        max_poll_interval: float = DEFAULT_JOB_STATUS_POLL_INTERVAL_MAX,
        connection_retry_interval: float = DEFAULT_JOB_STATUS_POLL_CONNECTION_RETRY_INTERVAL,
        soft_error_max: int = DEFAULT_JOB_STATUS_POLL_SOFT_ERROR_MAX,
        show_error_logs: bool = True,
        require_success: bool = True,
    ) -> BatchJob:
        """
        Start the batch job, poll its status and wait till it finishes (or fails)

        :param print: print/logging function to show progress/status
        :param max_poll_interval: maximum number of seconds to sleep between job status polls
        :param connection_retry_interval: how long to wait when status poll failed due to connection issue
        :param soft_error_max: maximum number of soft errors (e.g. temporary connection glitches) to allow
        :param show_error_logs: whether to automatically print error logs when the batch job failed.
        :param require_success: whether to raise an exception if the job did not finish successfully.

        :return: Handle to the job created at the backend.

        .. versionchanged:: 0.37.0
            Added argument ``show_error_logs``.

        .. versionchanged:: 0.42.0
            All arguments must be specified as keyword arguments,
            to eliminate the risk of positional mix-ups between heterogeneous arguments and flags.

        .. versionchanged:: 0.42.0
            Added argument ``require_success``.
        """
        # TODO rename `connection_retry_interval` to something more generic?
        start_time = time.time()

        def elapsed() -> str:
            return str(datetime.timedelta(seconds=time.time() - start_time)).rsplit(".")[0]

        def print_status(msg: str):
            print("{t} Job {i!r}: {m}".format(t=elapsed(), i=self.job_id, m=msg))

        # TODO: make `max_poll_interval`, `connection_retry_interval` class constants or instance properties?
        print_status("send 'start'")
        self.start()

        # TODO: also add  `wait` method so you can track a job that already has started explicitly
        #   or just rename this method to `wait` and automatically do start if not started yet?

        # Start with fast polling.
        poll_interval = min(5, max_poll_interval)
        status = None
        _soft_error_count = 0

        def soft_error(message: str):
            """Non breaking error (unless we had too much of them)"""
            nonlocal _soft_error_count
            _soft_error_count += 1
            if _soft_error_count > soft_error_max:
                raise OpenEoClientException("Excessive soft errors")
            print_status(message)
            time.sleep(connection_retry_interval)

        while True:
            # TODO: also allow a hard time limit on this infinite poll loop?
            try:
                job_info = self.describe()
            except requests.ConnectionError as e:
                soft_error("Connection error while polling job status: {e}".format(e=e))
                continue
            except OpenEoApiPlainError as e:
                if e.http_status_code in [HTTP_502_BAD_GATEWAY, HTTP_503_SERVICE_UNAVAILABLE]:
                    soft_error("Service availability error while polling job status: {e}".format(e=e))
                    continue
                else:
                    raise

            status = job_info.get("status", "N/A")

            progress = job_info.get("progress")
            if isinstance(progress, int):
                progress = f"{progress:d}%"
            elif isinstance(progress, float):
                progress = f"{progress:.1f}%"
            else:
                progress = "N/A"
            print_status(f"{status} (progress {progress})")
            if status not in ('submitted', 'created', 'queued', 'running'):
                break

            # Sleep for next poll (and adaptively make polling less frequent)
            time.sleep(poll_interval)
            poll_interval = min(1.25 * poll_interval, max_poll_interval)

        if require_success and status != "finished":
            # TODO: render logs jupyter-aware in a notebook context?
            if show_error_logs:
                print(f"Your batch job {self.job_id!r} failed. Error logs:")
                print(self.logs(level=logging.ERROR))
                print(
                    f"Full logs can be inspected in an openEO (web) editor or with `connection.job({self.job_id!r}).logs()`."
                )
>           raise JobFailedException(
                f"Batch job {self.job_id!r} didn't finish successfully. Status: {status} (after {elapsed()}).",
                job=self,
            )
E           openeo.rest.JobFailedException: Batch job 'j-2603110312024b9b898c00da6adfae34' didn't finish successfully. Status: error (after 0:10:57).

/opt/hostedtoolcache/Python/3.12.12/x64/lib/python3.12/site-packages/openeo/rest/job.py:382: JobFailedException
----------------------------- Captured stdout call -----------------------------
0:00:00 Job 'j-2603110312024b9b898c00da6adfae34': send 'start'
0:00:16 Job 'j-2603110312024b9b898c00da6adfae34': created (progress 0%)
0:00:22 Job 'j-2603110312024b9b898c00da6adfae34': created (progress 0%)
0:00:28 Job 'j-2603110312024b9b898c00da6adfae34': created (progress 0%)
0:00:36 Job 'j-2603110312024b9b898c00da6adfae34': created (progress 0%)
0:00:46 Job 'j-2603110312024b9b898c00da6adfae34': created (progress 0%)
0:01:01 Job 'j-2603110312024b9b898c00da6adfae34': created (progress 0%)
0:01:17 Job 'j-2603110312024b9b898c00da6adfae34': running (progress N/A)
0:01:36 Job 'j-2603110312024b9b898c00da6adfae34': running (progress N/A)
0:02:00 Job 'j-2603110312024b9b898c00da6adfae34': running (progress N/A)
0:02:30 Job 'j-2603110312024b9b898c00da6adfae34': running (progress N/A)
0:03:08 Job 'j-2603110312024b9b898c00da6adfae34': running (progress N/A)
0:03:55 Job 'j-2603110312024b9b898c00da6adfae34': running (progress N/A)
0:04:53 Job 'j-2603110312024b9b898c00da6adfae34': running (progress N/A)
0:05:53 Job 'j-2603110312024b9b898c00da6adfae34': running (progress N/A)
0:06:54 Job 'j-2603110312024b9b898c00da6adfae34': running (progress N/A)
0:07:54 Job 'j-2603110312024b9b898c00da6adfae34': running (progress N/A)
0:08:54 Job 'j-2603110312024b9b898c00da6adfae34': running (progress N/A)
0:09:55 Job 'j-2603110312024b9b898c00da6adfae34': running (progress N/A)
0:10:55 Job 'j-2603110312024b9b898c00da6adfae34': error (progress N/A)
Your batch job 'j-2603110312024b9b898c00da6adfae34' failed. Error logs:
[{'id': '[1773198933444, 151270]', 'time': '2026-03-11T03:15:33.444Z', 'level': 'error', 'message': 'Uncaught exception in thread Thread[#71,Executor task launch worker for task 4.0 in stage 47.0 (TID 8274),5,main]'}, {'id': '[1773198935440, 800781]', 'time': '2026-03-11T03:15:35.440Z', 'level': 'error', 'message': 'Lost executor 14 on 10.42.90.239: \nThe executor with id 14 exited with exit code 52(JVM OOM).\n\n\n\nThe API gave the following container statuses:\n\n\n\t container name: spark-kubernetes-executor\n\t container image: registry.internal/prod/openeo-geotrellis-kube-python311:20260310-547\n\t container state: terminated\n\t container started at: 2026-03-11T03:14:15Z\n\t container finished at: 2026-03-11T03:15:33Z\n\t exit code: 52\n\t termination reason: Error\n      '}, {'id': '[1773198969997, 573858]', 'time': '2026-03-11T03:16:09.997Z', 'level': 'error', 'message': 'Uncaught exception in thread Thread[#60,Executor task launch worker for task 4.1 in stage 47.0 (TID 8278),5,main]'}, {'id': '[1773198971478, 964870]', 'time': '2026-03-11T03:16:11.478Z', 'level': 'error', 'message': 'Lost executor 2 on 10.42.247.147: \nThe executor with id 2 exited with exit code 52(JVM OOM).\n\n\n\nThe API gave the following container statuses:\n\n\n\t container name: spark-kubernetes-executor\n\t container image: registry.internal/prod/openeo-geotrellis-kube-python311:20260310-547\n\t container state: terminated\n\t container started at: 2026-03-11T03:12:31Z\n\t container finished at: 2026-03-11T03:16:10Z\n\t exit code: 52\n\t termination reason: Error\n      '}, {'id': '[1773198972595, 350167]', 'time': '2026-03-11T03:16:12.595Z', 'level': 'error', 'message': 'Exception while beginning fetch of 14 outstanding blocks'}, {'id': '[1773198972603, 329465]', 'time': '2026-03-11T03:16:12.603Z', 'level': 'error', 'message': 'Failed to get block(s) from 10.42.247.147:46551'}, {'id': '[1773198972605, 991441]', 'time': '2026-03-11T03:16:12.605Z', 'level': 'error', 'message': 'Failed to get block(s) from 10.42.247.147:46551'}, {'id': '[1773198972606, 260427]', 'time': '2026-03-11T03:16:12.606Z', 'level': 'error', 'message': 'Failed to get block(s) from 10.42.247.147:46551'}, {'id': '[1773198972607, 57457]', 'time': '2026-03-11T03:16:12.607Z', 'level': 'error', 'message': 'Failed to get block(s) from 10.42.247.147:46551'}, {'id': '[1773198972608, 437825]', 'time': '2026-03-11T03:16:12.608Z', 'level': 'error', 'message': 'Failed to get block(s) from 10.42.247.147:46551'}, {'id': '[1773198972609, 840842]', 'time': '2026-03-11T03:16:12.609Z', 'level': 'error', 'message': 'Failed to get block(s) from 10.42.247.147:46551'}, {'id': '[1773198972610, 108419]', 'time': '2026-03-11T03:16:12.610Z', 'level': 'error', 'message': 'Failed to get block(s) from 10.42.247.147:46551'}, {'id': '[1773198972610, 383807]', 'time': '2026-03-11T03:16:12.610Z', 'level': 'error', 'message': 'Failed to get block(s) from 10.42.247.147:46551'}, {'id': '[1773198972611, 373797]', 'time': '2026-03-11T03:16:12.611Z', 'level': 'error', 'message': 'Failed to get block(s) from 10.42.247.147:46551'}, {'id': '[1773198972612, 631100]', 'time': '2026-03-11T03:16:12.612Z', 'level': 'error', 'message': 'Failed to get block(s) from 10.42.247.147:46551'}, {'id': '[1773198972613, 754646]', 'time': '2026-03-11T03:16:12.613Z', 'level': 'error', 'message': 'Failed to get block(s) from 10.42.247.147:46551'}, {'id': '[1773198972614, 83183]', 'time': '2026-03-11T03:16:12.614Z', 'level': 'error', 'message': 'Failed to get block(s) from 10.42.247.147:46551'}, {'id': '[1773198972615, 206281]', 'time': '2026-03-11T03:16:12.615Z', 'level': 'error', 'message': 'Failed to get block(s) from 10.42.247.147:46551'}, {'id': '[1773198972615, 868079]', 'time': '2026-03-11T03:16:12.615Z', 'level': 'error', 'message': 'Failed to get block(s) from 10.42.247.147:46551'}, {'id': '[1773198972632, 758243]', 'time': '2026-03-11T03:16:12.632Z', 'level': 'error', 'message': 'Stage error: org.apache.spark.shuffle.FetchFailedException\n\tat org.apache.spark.errors.SparkCoreErrors$.fetchFailedError(SparkCoreErrors.scala:439)\n\tat org.apache.spark.storage.ShuffleBlockFetcherIterator.throwFetchFailedException(ShuffleBlockFetcherIterator.scala:1253)\n\tat org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:983)\n\tat org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:87)\n\tat org.apache.spark.util.CompletionIterator.next(CompletionIterator.scala:29)\n\tat scala.collection.Iterator$$anon$10.nextCur(Iterator.scala:594)\n\tat scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:608)\n\tat scala.collection.Iterator$$anon$9.hasNext(Iterator.scala:583)\n\tat org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:31)\n\tat org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)\n\tat org.apache.spark.util.collection.ExternalAppendOnlyMap.insertAll(ExternalAppendOnlyMap.scala:156)\n\tat org.apache.spark.Aggregator.combineValuesByKey(Aggregator.scala:41)\n\tat org.apache.spark.shuffle.BlockStoreShuffleReader.read(BlockStoreShuffleReader.scala:145)\n\tat org.apache.spark.rdd.ShuffledRDD.compute(ShuffledRDD.scala:106)\n\tat org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:374)\n\tat org.apache.spark.rdd.RDD.iterator(RDD.scala:338)\n\tat org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)\n\tat org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:374)\n\tat org.apache.spark.rdd.RDD.iterator(RDD.scala:338)\n\tat org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:107)\n\tat org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:54)\n\tat org.apache.spark.TaskContext.runTaskWithListeners(TaskContext.scala:171)\n\tat org.apache.spark.scheduler.Task.run(Task.scala:147)\n\tat org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$5(Executor.scala:647)\n\tat org.apache.spark.util.SparkErrorUtils.tryWithSafeFinally(SparkErrorUtils.scala:80)\n\tat org.apache.spark.util.SparkErrorUtils.tryWithSafeFinally$(SparkErrorUtils.scala:77)\n\tat org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:99)\n\tat org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:650)\n\tat java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144)\n\tat java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642)\n\tat java.base/java.lang.Thread.run(Thread.java:1583)\nCaused by: org.apache.spark.ExecutorDeadException: [INTERNAL_ERROR_NETWORK] The relative remote executor(Id: 2), which maintains the block data to fetch is dead. SQLSTATE: XX000\n\tat org.apache.spark.network.netty.NettyBlockTransferService$$anon$2.createAndStart(NettyBlockTransferService.scala:146)\n\tat org.apache.spark.network.shuffle.RetryingBlockTransferor.transferAllOutstanding(RetryingBlockTransferor.java:181)\n\tat org.apache.spark.network.shuffle.RetryingBlockTransferor.start(RetryingBlockTransferor.java:160)\n\tat org.apache.spark.network.netty.NettyBlockTransferService.fetchBlocks(NettyBlockTransferService.scala:157)\n\tat org.apache.spark.storage.ShuffleBlockFetcherIterator.sendRequest(ShuffleBlockFetcherIterator.scala:376)\n\tat org.apache.spark.storage.ShuffleBlockFetcherIterator.send$1(ShuffleBlockFetcherIterator.scala:1223)\n\tat org.apache.spark.storage.ShuffleBlockFetcherIterator.fetchUpToMaxBytes(ShuffleBlockFetcherIterator.scala:1215)\n\tat org.apache.spark.storage.ShuffleBlockFetcherIterator.initialize(ShuffleBlockFetcherIterator.scala:721)\n\tat org.apache.spark.storage.ShuffleBlockFetcherIterator.<init>(ShuffleBlockFetcherIterator.scala:195)\n\tat org.apache.spark.shuffle.BlockStoreShuffleReader.read(BlockStoreShuffleReader.scala:73)\n\t... 18 more\n'}, {'id': '[1773199004878, 268485]', 'time': '2026-03-11T03:16:44.878Z', 'level': 'error', 'message': 'Uncaught exception in thread Thread[#163,Executor task launch worker for task 4.2 in stage 47.0 (TID 8279),5,main]'}, {'id': '[1773199006521, 234880]', 'time': '2026-03-11T03:16:46.521Z', 'level': 'error', 'message': 'Lost executor 1 on 10.42.247.180: \nThe executor with id 1 exited with exit code 52(JVM OOM).\n\n\n\nThe API gave the following container statuses:\n\n\n\t container name: spark-kubernetes-executor\n\t container image: registry.internal/prod/openeo-geotrellis-kube-python311:20260310-547\n\t container state: terminated\n\t container started at: 2026-03-11T03:12:28Z\n\t container finished at: 2026-03-11T03:16:45Z\n\t exit code: 52\n\t termination reason: Error\n      '}, {'id': '[1773199044630, 636718]', 'time': '2026-03-11T03:17:24.630Z', 'level': 'error', 'message': 'Uncaught exception in thread Thread[#119,Executor task launch worker for task 1.0 in stage 47.1 (TID 8329),5,main]'}, {'id': '[1773199045563, 501069]', 'time': '2026-03-11T03:17:25.563Z', 'level': 'error', 'message': 'Lost executor 13 on 10.42.5.190: \nThe executor with id 13 exited with exit code 52(JVM OOM).\n\n\n\nThe API gave the following container statuses:\n\n\n\t container name: spark-kubernetes-executor\n\t container image: registry.internal/prod/openeo-geotrellis-kube-python311:20260310-547\n\t container state: terminated\n\t container started at: 2026-03-11T03:14:16Z\n\t container finished at: 2026-03-11T03:17:25Z\n\t exit code: 52\n\t termination reason: Error\n      '}, {'id': '[1773199101513, 702434]', 'time': '2026-03-11T03:18:21.513Z', 'level': 'error', 'message': 'Missing an output location for shuffle 23 partition 4'}, {'id': '[1773199101523, 544939]', 'time': '2026-03-11T03:18:21.523Z', 'level': 'error', 'message': 'Stage error: org.apache.spark.shuffle.MetadataFetchFailedException: Missing an output location for shuffle 23 partition 4\n\tat org.apache.spark.MapOutputTracker$.validateStatus(MapOutputTracker.scala:1770)\n\tat org.apache.spark.MapOutputTracker$.$anonfun$convertMapStatuses$11(MapOutputTracker.scala:1715)\n\tat org.apache.spark.MapOutputTracker$.$anonfun$convertMapStatuses$11$adapted(MapOutputTracker.scala:1714)\n\tat scala.collection.IterableOnceOps.foreach(IterableOnce.scala:619)\n\tat scala.collection.IterableOnceOps.foreach$(IterableOnce.scala:617)\n\tat scala.collection.AbstractIterator.foreach(Iterator.scala:1306)\n\tat org.apache.spark.MapOutputTracker$.convertMapStatuses(MapOutputTracker.scala:1714)\n\tat org.apache.spark.MapOutputTrackerWorker.getMapSizesByExecutorIdImpl(MapOutputTracker.scala:1348)\n\tat org.apache.spark.MapOutputTrackerWorker.getMapSizesByExecutorId(MapOutputTracker.scala:1310)\n\tat org.apache.spark.shuffle.sort.SortShuffleManager.getReader(SortShuffleManager.scala:135)\n\tat org.apache.spark.shuffle.ShuffleManager.getReader(ShuffleManager.scala:67)\n\tat org.apache.spark.shuffle.ShuffleManager.getReader$(ShuffleManager.scala:61)\n\tat org.apache.spark.shuffle.sort.SortShuffleManager.getReader(SortShuffleManager.scala:73)\n\tat org.apache.spark.rdd.ShuffledRDD.compute(ShuffledRDD.scala:106)\n\tat org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:374)\n\tat org.apache.spark.rdd.RDD.iterator(RDD.scala:338)\n\tat org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)\n\tat org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:374)\n\tat org.apache.spark.rdd.RDD.iterator(RDD.scala:338)\n\tat org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:107)\n\tat org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:54)\n\tat org.apache.spark.TaskContext.runTaskWithListeners(TaskContext.scala:171)\n\tat org.apache.spark.scheduler.Task.run(Task.scala:147)\n\tat org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$5(Executor.scala:647)\n\tat org.apache.spark.util.SparkErrorUtils.tryWithSafeFinally(SparkErrorUtils.scala:80)\n\tat org.apache.spark.util.SparkErrorUtils.tryWithSafeFinally$(SparkErrorUtils.scala:77)\n\tat org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:99)\n\tat org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:650)\n\tat java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144)\n\tat java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642)\n\tat java.base/java.lang.Thread.run(Thread.java:1583)\n'}, {'id': '[1773199173398, 887130]', 'time': '2026-03-11T03:19:33.398Z', 'level': 'error', 'message': 'Exception while beginning fetch of 12 outstanding blocks'}, {'id': '[1773199173399, 295258]', 'time': '2026-03-11T03:19:33.399Z', 'level': 'error', 'message': 'Failed to get block(s) from 10.42.247.180:42913'}, {'id': '[1773199173400, 855344]', 'time': '2026-03-11T03:19:33.400Z', 'level': 'error', 'message': 'Failed to get block(s) from 10.42.247.180:42913'}, {'id': '[1773199173400, 968413]', 'time': '2026-03-11T03:19:33.400Z', 'level': 'error', 'message': 'Failed to get block(s) from 10.42.247.180:42913'}, {'id': '[1773199173401, 384269]', 'time': '2026-03-11T03:19:33.401Z', 'level': 'error', 'message': 'Failed to get block(s) from 10.42.247.180:42913'}, {'id': '[1773199173401, 579787]', 'time': '2026-03-11T03:19:33.401Z', 'level': 'error', 'message': 'Failed to get block(s) from 10.42.247.180:42913'}, {'id': '[1773199173402, 842193]', 'time': '2026-03-11T03:19:33.402Z', 'level': 'error', 'message': 'Failed to get block(s) from 10.42.247.180:42913'}, {'id': '[1773199173403, 44712]', 'time': '2026-03-11T03:19:33.403Z', 'level': 'error', 'message': 'Failed to get block(s) from 10.42.247.180:42913'}, {'id': '[1773199173403, 215768]', 'time': '2026-03-11T03:19:33.403Z', 'level': 'error', 'message': 'Failed to get block(s) from 10.42.247.180:42913'}, {'id': '[1773199173403, 348431]', 'time': '2026-03-11T03:19:33.403Z', 'level': 'error', 'message': 'Failed to get block(s) from 10.42.247.180:42913'}, {'id': '[1773199173404, 132889]', 'time': '2026-03-11T03:19:33.404Z', 'level': 'error', 'message': 'Failed to get block(s) from 10.42.247.180:42913'}, {'id': '[1773199173404, 619369]', 'time': '2026-03-11T03:19:33.404Z', 'level': 'error', 'message': 'Failed to get block(s) from 10.42.247.180:42913'}, {'id': '[1773199173405, 551699]', 'time': '2026-03-11T03:19:33.405Z', 'level': 'error', 'message': 'Failed to get block(s) from 10.42.247.180:42913'}, {'id': '[1773199185824, 711253]', 'time': '2026-03-11T03:19:45.824Z', 'level': 'error', 'message': 'Uncaught exception in thread Thread[#170,Executor task launch worker for task 1.0 in stage 47.2 (TID 8370),5,main]'}, {'id': '[1773199188623, 104529]', 'time': '2026-03-11T03:19:48.623Z', 'level': 'error', 'message': 'Lost executor 6 on 10.42.19.197: \nThe executor with id 6 exited with exit code 52(JVM OOM).\n\n\n\nThe API gave the following container statuses:\n\n\n\t container name: spark-kubernetes-executor\n\t container image: registry.internal/prod/openeo-geotrellis-kube-python311:20260310-547\n\t container state: terminated\n\t container started at: 2026-03-11T03:13:18Z\n\t container finished at: 2026-03-11T03:19:46Z\n\t exit code: 52\n\t termination reason: Error\n      '}, {'id': '[1773199189341, 740836]', 'time': '2026-03-11T03:19:49.341Z', 'level': 'error', 'message': 'Missing an output location for shuffle 23 partition 0'}, {'id': '[1773199189353, 545817]', 'time': '2026-03-11T03:19:49.353Z', 'level': 'error', 'message': 'Stage error: org.apache.spark.shuffle.MetadataFetchFailedException: Missing an output location for shuffle 23 partition 0\n\tat org.apache.spark.MapOutputTracker$.validateStatus(MapOutputTracker.scala:1770)\n\tat org.apache.spark.MapOutputTracker$.$anonfun$convertMapStatuses$11(MapOutputTracker.scala:1715)\n\tat org.apache.spark.MapOutputTracker$.$anonfun$convertMapStatuses$11$adapted(MapOutputTracker.scala:1714)\n\tat scala.collection.IterableOnceOps.foreach(IterableOnce.scala:619)\n\tat scala.collection.IterableOnceOps.foreach$(IterableOnce.scala:617)\n\tat scala.collection.AbstractIterator.foreach(Iterator.scala:1306)\n\tat org.apache.spark.MapOutputTracker$.convertMapStatuses(MapOutputTracker.scala:1714)\n\tat org.apache.spark.MapOutputTrackerWorker.getMapSizesByExecutorIdImpl(MapOutputTracker.scala:1348)\n\tat org.apache.spark.MapOutputTrackerWorker.getMapSizesByExecutorId(MapOutputTracker.scala:1310)\n\tat org.apache.spark.shuffle.sort.SortShuffleManager.getReader(SortShuffleManager.scala:135)\n\tat org.apache.spark.shuffle.ShuffleManager.getReader(ShuffleManager.scala:67)\n\tat org.apache.spark.shuffle.ShuffleManager.getReader$(ShuffleManager.scala:61)\n\tat org.apache.spark.shuffle.sort.SortShuffleManager.getReader(SortShuffleManager.scala:73)\n\tat org.apache.spark.rdd.ShuffledRDD.compute(ShuffledRDD.scala:106)\n\tat org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:374)\n\tat org.apache.spark.rdd.RDD.iterator(RDD.scala:338)\n\tat org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)\n\tat org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:374)\n\tat org.apache.spark.rdd.RDD.iterator(RDD.scala:338)\n\tat org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:107)\n\tat org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:54)\n\tat org.apache.spark.TaskContext.runTaskWithListeners(TaskContext.scala:171)\n\tat org.apache.spark.scheduler.Task.run(Task.scala:147)\n\tat org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$5(Executor.scala:647)\n\tat org.apache.spark.util.SparkErrorUtils.tryWithSafeFinally(SparkErrorUtils.scala:80)\n\tat org.apache.spark.util.SparkErrorUtils.tryWithSafeFinally$(SparkErrorUtils.scala:77)\n\tat org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:99)\n\tat org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:650)\n\tat java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144)\n\tat java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642)\n\tat java.base/java.lang.Thread.run(Thread.java:1583)\n'}, {'id': '[1773199247090, 161810]', 'time': '2026-03-11T03:20:47.090Z', 'level': 'error', 'message': 'Uncaught exception in thread Thread[#168,Executor task launch worker for task 1.1 in stage 47.2 (TID 8372),5,main]'}, {'id': '[1773199248678, 286106]', 'time': '2026-03-11T03:20:48.678Z', 'level': 'error', 'message': 'Lost executor 3 on 10.42.199.95: \nThe executor with id 3 exited with exit code 52(JVM OOM).\n\n\n\nThe API gave the following container statuses:\n\n\n\t container name: spark-kubernetes-executor\n\t container image: registry.internal/prod/openeo-geotrellis-kube-python311:20260310-547\n\t container state: terminated\n\t container started at: 2026-03-11T03:12:47Z\n\t container finished at: 2026-03-11T03:20:47Z\n\t exit code: 52\n\t termination reason: Error\n      '}, {'id': '[1773199272058, 667194]', 'time': '2026-03-11T03:21:12.058Z', 'level': 'error', 'message': 'Uncaught exception in thread Thread[#157,Executor task launch worker for task 1.0 in stage 47.3 (TID 8406),5,main]'}, {'id': '[1773199273709, 518902]', 'time': '2026-03-11T03:21:13.709Z', 'level': 'error', 'message': 'Lost executor 11 on 10.42.149.207: \nThe executor with id 11 exited with exit code 52(JVM OOM).\n\n\n\nThe API gave the following container statuses:\n\n\n\t container name: spark-kubernetes-executor\n\t container image: registry.internal/prod/openeo-geotrellis-kube-python311:20260310-547\n\t container state: terminated\n\t container started at: 2026-03-11T03:14:16Z\n\t container finished at: 2026-03-11T03:21:12Z\n\t exit code: 52\n\t termination reason: Error\n      '}, {'id': '[1773199276524, 5539]', 'time': '2026-03-11T03:21:16.524Z', 'level': 'error', 'message': 'Missing an output location for shuffle 23 partition 4'}, {'id': '[1773199276537, 336099]', 'time': '2026-03-11T03:21:16.537Z', 'level': 'error', 'message': 'Stage error: org.apache.spark.shuffle.MetadataFetchFailedException: Missing an output location for shuffle 23 partition 4\n\tat org.apache.spark.MapOutputTracker$.validateStatus(MapOutputTracker.scala:1770)\n\tat org.apache.spark.MapOutputTracker$.$anonfun$convertMapStatuses$11(MapOutputTracker.scala:1715)\n\tat org.apache.spark.MapOutputTracker$.$anonfun$convertMapStatuses$11$adapted(MapOutputTracker.scala:1714)\n\tat scala.collection.IterableOnceOps.foreach(IterableOnce.scala:619)\n\tat scala.collection.IterableOnceOps.foreach$(IterableOnce.scala:617)\n\tat scala.collection.AbstractIterator.foreach(Iterator.scala:1306)\n\tat org.apache.spark.MapOutputTracker$.convertMapStatuses(MapOutputTracker.scala:1714)\n\tat org.apache.spark.MapOutputTrackerWorker.getMapSizesByExecutorIdImpl(MapOutputTracker.scala:1348)\n\tat org.apache.spark.MapOutputTrackerWorker.getMapSizesByExecutorId(MapOutputTracker.scala:1310)\n\tat org.apache.spark.shuffle.sort.SortShuffleManager.getReader(SortShuffleManager.scala:135)\n\tat org.apache.spark.shuffle.ShuffleManager.getReader(ShuffleManager.scala:67)\n\tat org.apache.spark.shuffle.ShuffleManager.getReader$(ShuffleManager.scala:61)\n\tat org.apache.spark.shuffle.sort.SortShuffleManager.getReader(SortShuffleManager.scala:73)\n\tat org.apache.spark.rdd.ShuffledRDD.compute(ShuffledRDD.scala:106)\n\tat org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:374)\n\tat org.apache.spark.rdd.RDD.iterator(RDD.scala:338)\n\tat org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)\n\tat org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:374)\n\tat org.apache.spark.rdd.RDD.iterator(RDD.scala:338)\n\tat org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:107)\n\tat org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:54)\n\tat org.apache.spark.TaskContext.runTaskWithListeners(TaskContext.scala:171)\n\tat org.apache.spark.scheduler.Task.run(Task.scala:147)\n\tat org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$5(Executor.scala:647)\n\tat org.apache.spark.util.SparkErrorUtils.tryWithSafeFinally(SparkErrorUtils.scala:80)\n\tat org.apache.spark.util.SparkErrorUtils.tryWithSafeFinally$(SparkErrorUtils.scala:77)\n\tat org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:99)\n\tat org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:650)\n\tat java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144)\n\tat java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642)\n\tat java.base/java.lang.Thread.run(Thread.java:1583)\n'}, {'id': '[1773199276543, 290007]', 'time': '2026-03-11T03:21:16.543Z', 'level': 'error', 'message': 'Stage error: Job aborted due to stage failure: ShuffleMapStage 47 (load_collection: read by input product) has failed the maximum allowable number of times: 4. Most recent failure reason:\norg.apache.spark.shuffle.MetadataFetchFailedException: Missing an output location for shuffle 23 partition 4\n\tat org.apache.spark.MapOutputTracker$.validateStatus(MapOutputTracker.scala:1770)\n\tat org.apache.spark.MapOutputTracker$.$anonfun$convertMapStatuses$11(MapOutputTracker.scala:1715)\n\tat org.apache.spark.MapOutputTracker$.$anonfun$convertMapStatuses$11$adapted(MapOutputTracker.scala:1714)\n\tat scala.collection.IterableOnceOps.foreach(IterableOnce.scala:619)\n\tat scala.collection.IterableOnceOps.foreach$(IterableOnce.scala:617)\n\tat scala.collection.AbstractIterator.foreach(Iterator.scala:1306)\n\tat org.apache.spark.MapOutputTracker$.convertMapStatuses(MapOutputTracker.scala:1714)\n\tat org.apache.spark.MapOutputTrackerWorker.getMapSizesByExecutorIdImpl(MapOutputTracker.scala:1348)\n\tat org.apache.spark.MapOutputTrackerWorker.getMapSizesByExecutorId(MapOutputTracker.scala:1310)\n\tat org.apache.spark.shuffle.sort.SortShuffleManager.getReader(SortShuffleManager.scala:135)\n\tat org.apache.spark.shuffle.ShuffleManager.getReader(ShuffleManager.scala:67)\n\tat org.apache.spark.shuffle.ShuffleManager.getReader$(ShuffleManager.scala:61)\n\tat org.apache.spark.shuffle.sort.SortShuffleManager.getReader(SortShuffleManager.scala:73)\n\tat org.apache.spark.rdd.ShuffledRDD.compute(ShuffledRDD.scala:106)\n\tat org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:374)\n\tat org.apache.spark.rdd.RDD.iterator(RDD.scala:338)\n\tat org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)\n\tat org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:374)\n\tat org.apache.spark.rdd.RDD.iterator(RDD.scala:338)\n\tat org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:107)\n\tat org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:54)\n\tat org.apache.spark.TaskContext.runTaskWithListeners(TaskContext.scala:171)\n\tat org.apache.spark.scheduler.Task.run(Task.scala:147)\n\tat org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$5(Executor.scala:647)\n\tat org.apache.spark.util.SparkErrorUtils.tryWithSafeFinally(SparkErrorUtils.scala:80)\n\tat org.apache.spark.util.SparkErrorUtils.tryWithSafeFinally$(SparkErrorUtils.scala:77)\n\tat org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:99)\n\tat org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:650)\n\tat java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144)\n\tat java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642)\n\tat java.base/java.lang.Thread.run(Thread.java:1583)\n'}, {'id': '[1773199277144, 17292]', 'time': '2026-03-11T03:21:17.144Z', 'level': 'error', 'message': '[Container in shutdown] Uncaught exception in thread Thread[#161,SIGPWR handler,9,system]'}, {'id': '[1773199277723, 437545]', 'time': '2026-03-11T03:21:17.723Z', 'level': 'error', 'message': 'OpenEO batch job failed: A part of your process graph failed multiple times. Simply try submitting again, or use batch job logs to find more detailed information in case of persistent failures. Increasing executor memory may help if the root cause is not clear from the logs.'}]
Full logs can be inspected in an openEO (web) editor or with `connection.job('j-2603110312024b9b898c00da6adfae34').logs()`.
------------------------------ Captured log call -------------------------------
INFO     conftest:conftest.py:145 Connecting to 'openeo.dataspace.copernicus.eu'
INFO     openeo.config:config.py:193 Loaded openEO client config from sources: []
INFO     conftest:conftest.py:158 Checking for auth_env_var='OPENEO_AUTH_CLIENT_CREDENTIALS_CDSEFED' to drive auth against url='openeo.dataspace.copernicus.eu'.
INFO     conftest:conftest.py:162 Extracted provider_id='CDSE' client_id='openeo-apex-benchmarks-service-account' from auth_env_var='OPENEO_AUTH_CLIENT_CREDENTIALS_CDSEFED'
INFO     openeo.rest.connection:connection.py:302 Found OIDC providers: ['CDSE']
INFO     openeo.rest.auth.oidc:oidc.py:404 Doing 'client_credentials' token request 'https://identity.dataspace.copernicus.eu/auth/realms/CDSE/protocol/openid-connect/token' with post data fields ['grant_type', 'client_id', 'client_secret', 'scope'] (client_id 'openeo-apex-benchmarks-service-account')
INFO     openeo.rest.connection:connection.py:401 Obtained tokens: ['access_token', 'id_token']

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions