Skip to content

Conversation

@Shekharrajak
Copy link
Member

@Shekharrajak Shekharrajak commented Nov 17, 2025

@google-oss-prow
Copy link

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign astefanutti for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@Shekharrajak Shekharrajak force-pushed the feat/kep-spark-client branch 2 times, most recently from 0eafa4c to b458571 Compare November 17, 2025 17:37
@coveralls
Copy link

coveralls commented Nov 17, 2025

Pull Request Test Coverage Report for Build 19438736784

Details

  • 0 of 0 changed or added relevant lines in 0 files are covered.
  • No unchanged relevant lines lost coverage.
  • Overall coverage remained the same at 66.596%

Totals Coverage Status
Change from base Build 19231750341: 0.0%
Covered Lines: 2506
Relevant Lines: 3763

💛 - Coveralls

Added reference link to issue kubeflow#107 for context.

Signed-off-by: Shekhar Prasad Rajak <5774448+Shekharrajak@users.noreply.github.com>
# Custom backend implementation
from kubeflow.spark.backends.base import SparkBackend

class CustomBackend(SparkBackend):
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Users can extend the backend, if they want to have any specific changes or different way to connect or submit spark job

Copy link
Member

@andreyvelich andreyvelich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this great effort @Shekharrajak!
I left my initial thoughts.

### Goals

- Design a unified, Pythonic SDK for managing Spark applications on Kubernetes
- Support multiple backends (Kubernetes Operator, REST Gateway, Spark Connect) following the Trainer pattern
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For Trainer, backends represent various job submission (local subprocess, container, and Kubernetes). I am not sure if we can replicate it for Spark.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here we have job submission using K8S Operator backend, Spark Connector backend , Gateway backend (not implemented - we just have abstract class).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am wondering what is the main motivation to separate SessionClient() and BatchClient()?
Alternatively, we can just have unified SparkClient() which has sessions and batch APIs:

submit_job() <-- to create Spark Application and submit batch job
connect() <-- to create session and connect to existing Spark cluster

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • BatchSparkClient users never see connect(), create_session() methods
  • SparkSessionClient users never see submit_application(), wait_for_job_status() methods
  • This prevents runtime errors: Can't call wait_for_job_status() on a session object.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This helps user to know which methods are available clearly :

 client = SparkClient()
  # User sees BOTH batch AND session methods
  client.submit_job(...)        # For batch
  client.connect(...)           # For session
  client.get_job(...)           # Works with connect() or batch()  ? 

 # event if we take arg in config: 
  client = SparkClient(mode="batch")
  client.create_session(...)  # IDE will not show error, but runtime error 

Comment on lines 67 to 74
┌───────────┴─────────────┬──────────────────┬────────────────┐
▼ ▼ ▼ ▼
┌──────────────────┐ ┌──────────────────┐ ┌──────────────────┐ ┌──────────────────┐
│OperatorBackend │ │ GatewayBackend │ │ ConnectBackend │ │ LocalBackend │
│(Spark Operator │ │ (REST Gateway) │ │ (Spark Connect/ │ │ (Future) │
│ CRDs on K8s) │ │ │ │ Interactive) │ │ │
└──────────────────┘ └──────────────────┘ └──────────────────┘ └──────────────────┘
Batch Jobs Batch Jobs Sessions Local Dev
Copy link
Member

@andreyvelich andreyvelich Nov 23, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you explain reason of creating various backends? Can we just have an API: SparkClient().connect() which creates SparkConnect CR and connects to the existing cluster as we discussed?

Copy link
Member Author

@Shekharrajak Shekharrajak Nov 23, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The batch backend will have APIs like submit_application, wait_for_completion, get_logs, .. where user will just submit the job and can check the logs/results.
example: https://github.com/kubeflow/sdk/pull/158/files#diff-e692a5819ee6b1dc00cba3b58e91f058c0022d3ca9aa6f3ee468126f245eef89R89

But with interactive session spark client user will be able to run interactive SQL queries and DataFrame operations.
example: https://github.com/kubeflow/sdk/pull/158/files#diff-a5011f48c9d6d16ff6ddd65588f65a7c78abf5fbeb121cccb693c0892ce3a5aeR275

# Submit a Spark application
response = client.submit_application(
app_name="spark-pi",
main_application_file="local:///opt/spark/examples/src/main/python/pi.py",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am wondering if there is a way to allow users bypass function to SparkApplication similar to Trainer API: https://github.com/kubeflow/sdk?tab=readme-ov-file#run-your-first-pytorch-distributed-job

That might be interesting to explore how we can allow to submit SparkApplication without building an image.

Comment on lines 476 to 488
# Step 1: Interactive development with ConnectBackend
connect_config = ConnectBackendConfig(connect_url="sc://dev-cluster:15002")
dev_client = SparkClient(backend_config=connect_config)

with dev_client.create_session("dev") as session:
# Test and validate query
test_df = session.sql("SELECT * FROM data LIMIT 1000")
test_df.show()
# Iterate and refine...

# Step 2: Production batch job with OperatorBackend
prod_config = OperatorBackendConfig(namespace="production")
prod_client = SparkClient(backend_config=prod_config)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is pretty interesting experience for dev -> prod Spark lifecycle.
cc @shravan-achar @akshaychitneni @bigsur0 to explore.

Comment on lines +826 to +830
trainer_client.train(
name="train-model",
func=train_func,
num_nodes=4,
)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
trainer_client.train(
name="train-model",
func=train_func,
num_nodes=4,
)
trainer_client.train(
trainer=CustomTrainer(
func=train_func,
num_nodes=4
)
)

)
```

#### Integration with Pipelines
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cc @kubeflow/wg-pipeline-leads to explore

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants