feat(docs): KEP- Spark Client for Kubeflow SDK #163

Shekharrajak · 2025-11-17T17:25:24Z

Ref https://docs.google.com/document/d/1l57bBlpxrW4gLgAGnoq9Bg7Shre7Cglv4OLCox7ER_s/edit?tab=t.0

google-oss-prow · 2025-11-17T17:25:37Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign astefanutti for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

coveralls · 2025-11-17T19:15:35Z

Pull Request Test Coverage Report for Build 19438736784

Details

0 of 0 changed or added relevant lines in 0 files are covered.
No unchanged relevant lines lost coverage.
Overall coverage remained the same at 66.596%

Totals
Change from base Build 19231750341:	0.0%
Covered Lines:	2506
Relevant Lines:	3763

💛 - Coveralls

Added reference link to issue kubeflow#107 for context. Signed-off-by: Shekhar Prasad Rajak <5774448+Shekharrajak@users.noreply.github.com>

Shekharrajak · 2025-11-22T16:11:26Z

docs/proposals/107-spark-client/README.md

+# Custom backend implementation
+from kubeflow.spark.backends.base import SparkBackend
+
+class CustomBackend(SparkBackend):


Users can extend the backend, if they want to have any specific changes or different way to connect or submit spark job

andreyvelich

Thanks for this great effort @Shekharrajak!
I left my initial thoughts.

andreyvelich · 2025-11-23T17:00:51Z

docs/proposals/107-spark-client/README.md

+### Goals
+
+- Design a unified, Pythonic SDK for managing Spark applications on Kubernetes
+- Support multiple backends (Kubernetes Operator, REST Gateway, Spark Connect) following the Trainer pattern


For Trainer, backends represent various job submission (local subprocess, container, and Kubernetes). I am not sure if we can replicate it for Spark.

Here we have job submission using K8S Operator backend, Spark Connector backend , Gateway backend (not implemented - we just have abstract class).

I am wondering what is the main motivation to separate SessionClient() and BatchClient()?
Alternatively, we can just have unified SparkClient() which has sessions and batch APIs:

submit_job() <-- to create Spark Application and submit batch job connect() <-- to create session and connect to existing Spark cluster

BatchSparkClient users never see connect(), create_session() methods

SparkSessionClient users never see submit_application(), wait_for_job_status() methods

This prevents runtime errors: Can't call wait_for_job_status() on a session object.

This helps user to know which methods are available clearly :

client = SparkClient() # User sees BOTH batch AND session methods client.submit_job(...) # For batch client.connect(...) # For session client.get_job(...) # Works with connect() or batch() ? # event if we take arg in config: client = SparkClient(mode="batch") client.create_session(...) # IDE will not show error, but runtime error

andreyvelich · 2025-11-23T17:05:11Z

docs/proposals/107-spark-client/README.md

+           ┌───────────┴─────────────┬──────────────────┬────────────────┐
+           ▼                         ▼                  ▼                ▼
+┌──────────────────┐   ┌──────────────────┐   ┌──────────────────┐   ┌──────────────────┐
+│OperatorBackend   │   │  GatewayBackend  │   │ ConnectBackend   │   │  LocalBackend    │
+│(Spark Operator   │   │  (REST Gateway)  │   │  (Spark Connect/ │   │  (Future)        │
+│ CRDs on K8s)     │   │                  │   │   Interactive)   │   │                  │
+└──────────────────┘   └──────────────────┘   └──────────────────┘   └──────────────────┘
+     Batch Jobs            Batch Jobs            Sessions              Local Dev


Can you explain reason of creating various backends? Can we just have an API: SparkClient().connect() which creates SparkConnect CR and connects to the existing cluster as we discussed?

The batch backend will have APIs like submit_application, wait_for_completion, get_logs, .. where user will just submit the job and can check the logs/results.
example: https://github.com/kubeflow/sdk/pull/158/files#diff-e692a5819ee6b1dc00cba3b58e91f058c0022d3ca9aa6f3ee468126f245eef89R89

But with interactive session spark client user will be able to run interactive SQL queries and DataFrame operations.
example: https://github.com/kubeflow/sdk/pull/158/files#diff-a5011f48c9d6d16ff6ddd65588f65a7c78abf5fbeb121cccb693c0892ce3a5aeR275

andreyvelich · 2025-11-23T17:07:22Z

docs/proposals/107-spark-client/README.md

+# Submit a Spark application
+response = client.submit_application(
+    app_name="spark-pi",
+    main_application_file="local:///opt/spark/examples/src/main/python/pi.py",


I am wondering if there is a way to allow users bypass function to SparkApplication similar to Trainer API: https://github.com/kubeflow/sdk?tab=readme-ov-file#run-your-first-pytorch-distributed-job

That might be interesting to explore how we can allow to submit SparkApplication without building an image.

docs/proposals/107-spark-client/README.md

andreyvelich · 2025-11-23T18:17:55Z

docs/proposals/107-spark-client/README.md

+# Step 1: Interactive development with ConnectBackend
+connect_config = ConnectBackendConfig(connect_url="sc://dev-cluster:15002")
+dev_client = SparkClient(backend_config=connect_config)
+
+with dev_client.create_session("dev") as session:
+    # Test and validate query
+    test_df = session.sql("SELECT * FROM data LIMIT 1000")
+    test_df.show()
+    # Iterate and refine...
+
+# Step 2: Production batch job with OperatorBackend
+prod_config = OperatorBackendConfig(namespace="production")
+prod_client = SparkClient(backend_config=prod_config)


This is pretty interesting experience for dev -> prod Spark lifecycle.
cc @shravan-achar @akshaychitneni @bigsur0 to explore.

docs/proposals/107-spark-client/README.md

andreyvelich · 2025-11-23T18:21:06Z

docs/proposals/107-spark-client/README.md

+trainer_client.train(
+    name="train-model",
+    func=train_func,
+    num_nodes=4,
+)


Suggested change

trainer_client.train(

name="train-model",

func=train_func,

num_nodes=4,

)

trainer_client.train(

trainer=CustomTrainer(

func=train_func,

num_nodes=4

)

)

andreyvelich · 2025-11-23T18:21:12Z

docs/proposals/107-spark-client/README.md

+)
+```
+
+#### Integration with Pipelines


cc @kubeflow/wg-pipeline-leads to explore

…parkClient and SparkSessionClient

google-oss-prow bot requested review from kramaranya and szaher November 17, 2025 17:25

google-oss-prow bot added the size/XXL label Nov 17, 2025

Shekharrajak force-pushed the feat/kep-spark-client branch from dfdb297 to 8ad3b6e Compare November 17, 2025 17:26

google-oss-prow bot added size/XL and removed size/XXL labels Nov 17, 2025

Shekharrajak force-pushed the feat/kep-spark-client branch 2 times, most recently from 0eafa4c to b458571 Compare November 17, 2025 17:37

create KEP for spark client

30f3336

Shekharrajak force-pushed the feat/kep-spark-client branch from b458571 to 30f3336 Compare November 17, 2025 17:40

Shekharrajak mentioned this pull request Nov 17, 2025

feat: Support for Spark Client in Kubeflow SDK #158

Open

Add reference to issue kubeflow#107 in Spark client proposal

c2fa199

Added reference link to issue kubeflow#107 for context. Signed-off-by: Shekhar Prasad Rajak <5774448+Shekharrajak@users.noreply.github.com>

Shekharrajak commented Nov 22, 2025

View reviewed changes

Update KEP-107 with Spark Connect backend implementation details

d598a69

google-oss-prow bot added size/XXL and removed size/XL labels Nov 22, 2025

andreyvelich reviewed Nov 23, 2025

View reviewed changes

andreyvelich mentioned this pull request Nov 23, 2025

Integrating the Kubeflow Spark Application with the Kubeflow SDK #107

Open

Update KEP-107 to reflect specialized client architecture with BatchS…

33a9ad1

…parkClient and SparkSessionClient

Shekharrajak force-pushed the feat/kep-spark-client branch from f428a9d to 33a9ad1 Compare November 23, 2025 18:24

andreyvelich mentioned this pull request Nov 23, 2025

feat: add rest api server proposal kubeflow/spark-operator#2517

Open

8 tasks

Shekharrajak added 3 commits November 24, 2025 16:16

Update KEP with TrainerClient-aligned API names

5cadd5c

user story at the top

90ba54e

per-submission image specification

db6f3b3

feat(docs): KEP- Spark Client for Kubeflow SDK #163

Are you sure you want to change the base?

feat(docs): KEP- Spark Client for Kubeflow SDK #163

Uh oh!

Conversation

Shekharrajak commented Nov 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

google-oss-prow bot commented Nov 17, 2025

Uh oh!

coveralls commented Nov 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Pull Request Test Coverage Report for Build 19438736784

Details

💛 - Coveralls

Uh oh!

Choose a reason for hiding this comment

Uh oh!

andreyvelich left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

andreyvelich Nov 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Shekharrajak Nov 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Shekharrajak commented Nov 17, 2025 •

edited

Loading

coveralls commented Nov 17, 2025 •

edited

Loading

andreyvelich Nov 23, 2025 •

edited

Loading

Shekharrajak Nov 23, 2025 •

edited

Loading