-
-
Notifications
You must be signed in to change notification settings - Fork 3
Description
When running very big jobs via spark we see that sometimes the ephemeral volumes of executors are filled endlessly. That leads to fully used ephemeral storage of the node which then leads to consequences for other pods on that node because they are running out of ephemeral storage as well. We had some trino downtime because of that.
The Volume causing it in the executor looks like the following, seems like a kind of spilling.
A second thing which is arising out of it, is a second driver spawning in a magical way. I can't explain it yet but in rare cases, the spark application suddenly has 2 running drivers. I'll do my best to provide further details on that.
question
is there a way to configure ephemeral storage limits (and maybe more properties) globally for all spark applications the operator applies? it's okay for us to fail one application, but it's a big mess if any application can kill nodes by starting to spill on the nodes filesystem