SNOW-2367850: task integration example update #250

sfc-gh-ajiang · 2026-01-06T23:52:45Z

No description provided.

samples/ml/ml_jobs/e2e_task_graph/src/pipeline_dag.py

samples/ml/ml_jobs/e2e_task_graph/src/train_model.py

samples/ml/ml_jobs/e2e_task_graph/src/modeling.py

samples/ml/ml_jobs/e2e_task_graph/src/pipeline_dag.py

samples/ml/ml_jobs/e2e_task_graph/src/modeling.py

samples/ml/ml_jobs/e2e_task_graph/src/train_model.py

sfc-gh-dhung

Remember this is a public facing sample, please be sure the code quality is high. It's especially important for the code to be simple and readable, with self documenting variable/function names and sufficient comments for non-experts to understand

samples/ml/ml_jobs/e2e_task_graph/src/train_model.py

sfc-gh-dhung · 2026-01-15T19:31:39Z

samples/ml/ml_jobs/e2e_task_graph/src/modeling.py

-# NOTE: Remove `target_instances=2` to run training on a single node
-#       See https://docs.snowflake.com/en/developer-guide/snowflake-ml/ml-jobs/distributed-ml-jobs
-@remote(COMPUTE_POOL, stage_name=JOB_STAGE, target_instances=2)


One of the main points of this sample is to demonstrate how easy it is to convert a local pipeline to pushing certain steps down into ML Jobs. Needing to write a separate script file which we submit_file() just for this conversion severely weakens this story. Why can't we just keep using a @remote() decorated function? @remote(...) should convert the function into an MLJobDefinition which we can directly use in pipeline_dag without needing an explicit MLJobDefinition.register() call

That is currently @remote does not create job definition and it creates a job directly. Currently, we only merged the PR for phase one and phase 2 is in review.

Let's hold off on merging this until @remote is ready then

Since the @remote change is now available, can we now call this as an ML Job directly from pipeline_dag?

I am little confused here. Do you mean we create a job inside the task directly?

samples/ml/ml_jobs/e2e_task_graph/src/train_model.py

samples/ml/ml_jobs/e2e_task_graph/src/pipeline_dag.py

sfc-gh-dhung · 2026-01-29T17:57:25Z

samples/ml/ml_jobs/e2e_task_graph/src/modeling.py

-# NOTE: Remove `target_instances=2` to run training on a single node
-#       See https://docs.snowflake.com/en/developer-guide/snowflake-ml/ml-jobs/distributed-ml-jobs
-@remote(COMPUTE_POOL, stage_name=JOB_STAGE, target_instances=2)


Since the @remote change is now available, can we now call this as an ML Job directly from pipeline_dag?

sfc-gh-dhung · 2026-01-29T17:58:36Z

samples/ml/ml_jobs/e2e_task_graph/src/pipeline_dag.py

+@remote(COMPUTE_POOL, stage_name=JOB_STAGE, database=DB_NAME, schema=SCHEMA_NAME)
+def train_model(dataset_info: Optional[str] = None) -> Optional[str]:
+    '''
+    ML Job to train a model on the training dataset and register it in the model registry.

-def train_model(session: Session) -> str:
-    """
-    DAG task to train a machine learning model.
+    This function trains an XGBoost classifier on the provided training data and registers it in the model registry. 
+    This function is executed remotely on Snowpark Container Services.
+
+    Args:
+        dataset_info (Optional[str]): JSON string containing serialized dataset information for training. If this function is called in a DAG task, 
+        this argument is passed from the previous DAG task, otherwise it is passed manually.
+
+    Returns:
+        Optional[str]: JSON string containing serialized model information for registration. If this function is called in a DAG task, 
+        this return value is passed to the next DAG task, otherwise it is as ML Job result.
+    '''
+    session = Session.builder.getOrCreate()
+    ctx = None
+    config = None
+
+    if dataset_info:
+        dataset_info_dicts = json.loads(dataset_info)
+    try:
+        ctx = TaskContext(session)
+        config = run_config.RunConfig.from_task_context(ctx)
+        dataset_info_dicts = json.loads(ctx.get_predecessor_return_value("PREPARE_DATA"))
+    except SnowparkSQLException:
+        print("there is no predecessor return value, fallback to local mode")

-    This function is executed as part of the DAG workflow to train a model using the prepared datasets.
-    It retrieves dataset information from the previous task, trains the model, evaluates it on both
-    training and test sets, and saves the model to a stage for later use.
+    datasets = {
+        key: DatasetInfo(**info_dict) for key, info_dict in dataset_info_dicts.items()
+    }
+    train_ds=load_dataset(
+            session,
+            datasets["full"].fully_qualified_name,
+            datasets["full"].version,
+    )
+    model_obj = modeling.train_model(session, datasets["train"])
+    train_metrics = modeling.evaluate_model(
+        session, model_obj, train_ds.read.data_sources[0], prefix="train"
+    )
+    version = f"v{uuid.uuid4().hex}"
+    mv = modeling.register_model(session, model_obj, config.model_name if config and config.model_name else "mortgage_model", version, train_ds, metrics={}) if config else modeling.register_model(session, model_obj, "mortgage_model", version, train_ds, metrics=train_metrics)
+    if ctx and config:
+        ctx.set_return_value(json.dumps({"model_name": mv.fully_qualified_model_name, "version_name": mv.version_name}))
+    return json.dumps({"model_name": mv.fully_qualified_model_name, "version_name": mv.version_name})


What is the gap preventing us from not needing this?

https://github.com/Snowflake-Labs/sf-samples/pull/250/files#r2695685818

I am confused here. I think we should use @Remote right?
create a job definition -> integrate the definition into task SDK?

Can we just have pipeline_dag and pipeline_local use the same function with no extra wrapping?

currently, pipeline_dag and pipeline_local use the same function pipeline_dag.train_model, what do you think?

sfc-gh-dhung · 2026-01-29T17:59:10Z

samples/ml/ml_jobs/e2e_task_graph/src/run_config.py

Why do we need this as a separate file? Looks like it's only used in pipeline_dag currently

I was thinking that pipeline_local.py and pipeline_day.py should focus on orchestration logic—like creating jobs or tasks. Since this class is more about handling task configuration, it might make sense to move it into a separate file for better separation of concerns.

For now, I’ve reverted the changes.

sfc-gh-dhung · 2026-02-05T18:56:41Z

samples/ml/ml_jobs/e2e_task_graph/README.md


 ```bash
 python src/pipeline_local.py
-python src/pipeline_local.py --no-register  # Skip model registration for faster experimentation


why remove?

That is because we always register the model. But we do not push it to production.
The reason I do like this is that I got this error #250 (comment) when I save the model to a file.

Sounds like there is a different bug then, saving the model to a file should definitely work. Please be sure to fix that

I think the root cause is that the memory synchronization gap between the head node and worker nodes. The script runs only on the Head Node, which acts as a coordinator that sends training instructions to the Worker Nodes, but these workers keep the resulting model weights in their own local memory once training is complete. It seems like the Snowflake XGBEstimator is designed to be "lazy", the Head Node does not automatically pull those heavy weights into its own local process. Consequently, when the script immediately tries to evaluate or serialize the model, the Head Node looks at its own empty memory and crashes because the "brain" of the model is still physically trapped on the separate worker nodes.

Do you have any recommend suggestions to fix it? I am not sure if it is a bug or not.

Are you able to reproduce this behavior in Notebooks? Training then inferencing immediately with the same model is a very common use case, so you should be able to quickly validate that with a multi-node Notebook

I even cannot reproduce it in ML Job. If the ML Job only trains the model and returns the model, everything looks good.

sfc-gh-dhung · 2026-02-05T18:57:35Z

samples/ml/ml_jobs/e2e_task_graph/README.md

 ```python
 @remote(COMPUTE_POOL, stage_name=JOB_STAGE, target_instances=2)
-def train_model(session: Session, input_data: DataSource) -> XGBClassifier:
+def train_model(input_data: DataSource) -> Optional[str]:


Can you explain why we return a string in the README?

Sure, I will update it.

sfc-gh-dhung · 2026-02-05T18:57:44Z

samples/ml/ml_jobs/e2e_task_graph/README.md

+@remote(COMPUTE_POOL, stage_name=JOB_STAGE, target_instances=2)
+def train_model(input_data: DataSource) -> Optional[str]:
+    ...


why repeat this?

oops, forgot to delete it. will delete it

sfc-gh-dhung · 2026-02-05T18:57:54Z

samples/ml/ml_jobs/e2e_task_graph/README.md


 ```python
-mv = register_model(session, model, model_name, version, train_ds, metrics)
+# get model version from train model


That is because we always register the model. But we do not push it to production.
The reason I do like this is that I got this error #250 (comment) when I save the model to a file.

sfc-gh-dhung · 2026-02-05T19:00:27Z

samples/ml/ml_jobs/e2e_task_graph/src/pipeline_local.py

 from snowflake.snowpark import Session

-import modeling
+import pipeline_dag


pipeline_local should not have a dependency on pipeline_dag. Ideally the two don't know about each other, if necessary then dag can depend on local

Sure — I’ll update it. One thought is to add @remote to modeling.py. However, inside the job payload we rely on the run config, which would cause pipeline_dag.py to import modeling.py and modeling.py to import pipeline_dag.py. That introduces a circular import.

If I add @remote to pipeline_local.py and pipeline_local.py imports pipeline_dag.py to use the run config, then pipeline_dag.py needs to import pipeline_local.py to use the train_model function. That introduces a circular import.

What do you think about moving the run config out of pipeline_dag.py into a separate file?

Sounds fine. It's also okay for both pipeline definitions to define their own @remote run_train_model functions, which just handle arguments then pass them to modeling.train_model. In this case, each @remote function should just have a few lines of code (maybe 3-4 at most); e.g. the local pipeline just accepts args and directly passes them to modeling.train_model, while the DAG pipeline reads from RunConfig before calling modeling.train_model

sfc-gh-dhung · 2026-02-05T19:00:56Z

samples/ml/ml_jobs/e2e_task_graph/src/pipeline_local.py

    session = session_builder.getOrCreate()
-    modeling.ensure_environment(session)
+    pipeline_dag._ensure_environment(session)
+    cp.register_pickle_by_value(pipeline_dag)


because it imports pipeline_dag.py to use train_model

why does that mean we need to pickle it?

sfc-gh-dhung · 2026-02-05T19:02:10Z

samples/ml/ml_jobs/e2e_task_graph/src/pipeline_dag.py

    }
    return json.dumps(dataset_info)

+@remote(COMPUTE_POOL, stage_name=JOB_STAGE, database=DB_NAME, schema=SCHEMA_NAME)


multi-node?

sure, will update it

sfc-gh-dhung · 2026-02-05T19:03:26Z

samples/ml/ml_jobs/e2e_task_graph/src/pipeline_dag.py

+        config = RunConfig.from_task_context(ctx)
+        dataset_info_dicts = json.loads(ctx.get_predecessor_return_value("PREPARE_DATA"))
+    except SnowparkSQLException:
+        print("there is no predecessor return value, fallback to local mode")


make sure errors/warnings are meaningful to users who aren't already familiar with tasks and ml jobs. In this case, predecessor return value and local mode are meaningless/unknown terms

will update it

sfc-gh-dhung · 2026-02-05T19:04:54Z

samples/ml/ml_jobs/e2e_task_graph/src/pipeline_dag.py

+    if dataset_info:
+        dataset_info_dicts = json.loads(dataset_info)
+    try:
+        ctx = TaskContext(session)
+        config = RunConfig.from_task_context(ctx)
+        dataset_info_dicts = json.loads(ctx.get_predecessor_return_value("PREPARE_DATA"))
+    except SnowparkSQLException:
+        print("there is no predecessor return value, fallback to local mode")
+
+    datasets = {
+        key: DatasetInfo(**info_dict) for key, info_dict in dataset_info_dicts.items()
+    }
+    train_ds=load_dataset(
+            session,
+            datasets["full"].fully_qualified_name,
+            datasets["full"].version,
+    )
+    model_obj = modeling.train_model(session, datasets["train"])
+    train_metrics = modeling.evaluate_model(
+        session, model_obj, train_ds.read.data_sources[0], prefix="train"
+    )
+    version = f"v{uuid.uuid4().hex}"
+    mv = modeling.register_model(session, model_obj, config.model_name if config and config.model_name else "mortgage_model", version, train_ds, metrics={}) if config else modeling.register_model(session, model_obj, "mortgage_model", version, train_ds, metrics=train_metrics)
+    if ctx and config:
+        ctx.set_return_value(json.dumps({"model_name": mv.fully_qualified_model_name, "version_name": mv.version_name}))
+    return json.dumps({"model_name": mv.fully_qualified_model_name, "version_name": mv.version_name})


Add comments and whitespace for readability please. Remember this is a public sample/tutorial

sfc-gh-ajiang added 2 commits January 6, 2026 09:32

update the task SDK

0291c04

revert changes

375d3ab

sfc-gh-ajiang requested a review from sfc-gh-dhung January 6, 2026 23:52

sfc-gh-dhung reviewed Jan 8, 2026

View reviewed changes

sfc-gh-ajiang added 2 commits January 8, 2026 19:01

resolve the comments

703488f

revert unnecessary changes

f96fd9b

sfc-gh-ajiang requested a review from sfc-gh-dhung January 9, 2026 03:03

revert unnecessary changes

ad5d13b

sfc-gh-dhung reviewed Jan 10, 2026

View reviewed changes

samples/ml/ml_jobs/e2e_task_graph/src/pipeline_dag.py Outdated Show resolved Hide resolved

samples/ml/ml_jobs/e2e_task_graph/src/train_model.py Outdated Show resolved Hide resolved

sfc-gh-ajiang added 2 commits January 12, 2026 16:22

resolve the comments

fd6a7dc

resolve the comments

5524f9a

sfc-gh-ajiang requested a review from sfc-gh-dhung January 13, 2026 00:48

sfc-gh-ajiang added 2 commits January 12, 2026 16:49

resolve the comments

3015500

resolve the comments

94e941a

sfc-gh-ajiang commented Jan 13, 2026

View reviewed changes

samples/ml/ml_jobs/e2e_task_graph/src/modeling.py Outdated Show resolved Hide resolved

sfc-gh-dhung reviewed Jan 13, 2026

View reviewed changes

sfc-gh-sichen reviewed Jan 14, 2026

View reviewed changes

samples/ml/ml_jobs/e2e_task_graph/src/train_model.py Outdated Show resolved Hide resolved

sfc-gh-ajiang added 2 commits January 13, 2026 19:14

resolve the comments

74a1edb

reformat the script

7f992c4

sfc-gh-dhung reviewed Jan 15, 2026

View reviewed changes

sfc-gh-ajiang added 6 commits January 21, 2026 16:56

update the sample

01c160f

update the sample

9fb7478

update the sample

071eb1c

update the sample

f6ab75c

update the samples

16b0b42

update the samples

9fe2b7a

sfc-gh-ajiang requested a review from sfc-gh-dhung January 23, 2026 15:40

add more information for ML Job Definition

6b7c373

sfc-gh-dhung reviewed Jan 29, 2026

View reviewed changes

remove session creation at module level

cf1e70f

resolve the comments

1b5f9b1

sfc-gh-ajiang requested a review from sfc-gh-dhung February 4, 2026 09:01

update the image

d6a9bd0

sfc-gh-dhung reviewed Feb 5, 2026

View reviewed changes

SNOW-2367850: task integration example update #250

Are you sure you want to change the base?

SNOW-2367850: task integration example update #250

Uh oh!

Conversation

sfc-gh-ajiang commented Jan 6, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

sfc-gh-dhung left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

sfc-gh-ajiang Feb 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

sfc-gh-ajiang Feb 6, 2026 •

edited

Loading

sfc-gh-dhung Feb 6, 2026 •

edited

Loading

sfc-gh-ajiang Feb 6, 2026 •

edited

Loading