Skip to content

Add file_extension, enable_legacy_filename fields to BlobType#7009

Open
ddl-rliu wants to merge 2 commits intoflyteorg:masterfrom
ddl-rliu:rliu.DOM-75010.file-ext-copilot
Open

Add file_extension, enable_legacy_filename fields to BlobType#7009
ddl-rliu wants to merge 2 commits intoflyteorg:masterfrom
ddl-rliu:rliu.DOM-75010.file-ext-copilot

Conversation

@ddl-rliu
Copy link
Contributor

@ddl-rliu ddl-rliu commented Mar 10, 2026

Tracking issue

Closes #7024

Why are the changes needed?

We'll assume the above behavior is not a bug (it is long-standing, and a "bugfix" will likely break existing workflows). Instead, this PR proposes enhancements to FlyteFile/Blob to support writing workflow inputs with the file extension. This PR includes several changes across flyteidl and flytecopilot.

The enhancements allow workflows to be flexible when writing blobs. Specifically, the existing behavior where flytecopilot writes blobs during the download phase without the file extension (e.g. "inputs/data") can now be enhanced so that the file extension is included (e.g. "inputs/data.csv").

What changes were proposed in this pull request?

[flyteidl]

Add a new file_extension string field to the BlobType protobuf message, allowing FlyteFile to optionally specify a file extension (e.g. "csv") that flytecopilot appends when writing blobs during the download phase (e.g. "data.csv"). When empty (the default), behavior is unchanged.

Add a new enable_legacy_filename bool field to the BlobType protobuf message, allowing FlyteFile to optionally specify whether to preserve backward compatibility for tasks that read from the extensionless path.

Regenerated all protobuf bindings (Go, Python, JS, Rust, ES, Swagger).

[flytecopilot]

The copilot download phase infers the desired download behavior(s) from the input interface.

[flytekit] PR: flyteorg/flytekit#3406

Alternatives considered

The PR's approach configures the file download behaviors at the BlobType level (e.g. per FlyteFile). This has several pros (granularity, explicitness).

But, one con is that unlike BlobType.format (which can be inferred from the output filename e.g. "data.csv" -> format: "csv"), the new fields BlobType.file_extension, BlobType.enable_legacy_filename could not be inferred from the output filename (does "data.csv" match to file_extension: "csv"?). Ultimately, this seems like it is introducing a minor inconsistency, but acceptable given the benefits. (It also seems like this is all hypothetical - copilot upload does not actually infer the format from the filename, so this concern may be somewhat moot?)

Here are the other approaches I considered:

1. New flyte-copilot CLI flags --file_extension_config ENUM=(disabled, enabled, legacy)

file_extension_config = disabled (by default) - Same behavior as today (data: FlyteFile[csv] is written to outputs/data)
file_extension_config = enabled - New behavior as today (data: FlyteFile[csv] is written to outputs/data.csv)
file_extension_config = legacy - New behavior but backwards compatible, (data: FlyteFile[csv] is written to outputs/data and outputs/data.csv)

Cons:

  • The download behavior is granular at the file level, but the CLI flag is granular at the task level (more or less). This makes the configuration less flexible, and begins to somewhat overload the configurability of copilot at the CLI level.

2. No changes, user code should be modified to read from the existing path e.g. data and add the extension itself

Cons:

  • Migrating large projects to Flyte becomes burdensome, when needing to add another extra step like this here - user code needs to be modified to handle adjusting the file extensions on inputs.

How was this patch tested?

# Container task python code
def t1(datasetA: FlyteFile[TypeVar('csv')], datasetB=Annotated[FlyteFile[TypeVar('csv')], FileDownloadConfig(file_extension="csv", enable_legacy_filename=True)]): …

# flyte-copilot logs
{"json":{},"level":"debug","msg":"inputInterface: [variables:{key:\"datasetA\"  value:{type:{blob:{format:\"csv\"}}  description:\"datasetA\"}}  variables:{key:\"datasetB\"  value:{type:{blob:{format:\"csv\"  file_extension:\"csv\"  enable_legacy_filename:true}}  description:\"datasetB\"}}]","ts":"2026-03-10T23:20:55Z"}
{"json":{},"level":"debug","msg":"varFilenames: map[datasetA:[datasetA[] datasetB:[datasetB.csv datasetB]]","ts":"2026-03-10T23:20:55Z"}
...
{"json":{},"level":"info","msg":"Successfully copied [4119] bytes remote data from [s3://rliucp108882-flyte-data//a0c46ac7-eea3-427d-91c1-a3c6b4a27e83/datasetB] to local [/execution-vol/flows/workflow/inputs/datasetB]","ts":"2026-03-10T23:20:55Z"}
{"json":{},"level":"info","msg":"Successfully copied [4119] bytes remote data from [s3://rliucp108882-flyte-data//a0c46ac7-eea3-427d-91c1-a3c6b4a27e83/datasetB] to local [/execution-vol/flows/workflow/inputs/datasetB.csv]","ts":"2026-03-10T23:20:55Z"}
{"json":{},"level":"info","msg":"Successfully copied [3904] bytes remote data from [s3://rliucp108882-flyte-data//af859c71-811d-4c26-9845-b3fd8830ab6b/datasetA] to local [/execution-vol/flows/workflow/inputs/datasetA]","ts":"2026-03-10T23:20:55Z"}

Labels

Please add one or more of the following labels to categorize your PR:

  • added: For new features.
  • changed: For changes in existing functionality.
  • deprecated: For soon-to-be-removed features.
  • removed: For features being removed.
  • fixed: For any bug fixed.
  • security: In case of vulnerabilities

This is important to improve the readability of release notes.

Setup process

Screenshots

Check all the applicable boxes

  • I updated the documentation accordingly.
  • All new and existing tests passed.
  • All commits are signed-off.

Related PRs

Docs link

@codecov
Copy link

codecov bot commented Mar 11, 2026

Codecov Report

❌ Patch coverage is 40.42553% with 28 lines in your changes missing coverage. Please review.
✅ Project coverage is 56.91%. Comparing base (5c23907) to head (d687cd4).
⚠️ Report is 3 commits behind head on master.

Files with missing lines Patch % Lines
flytecopilot/data/download.go 30.76% 25 Missing and 2 partials ⚠️
flytecopilot/data/upload.go 50.00% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master    #7009      +/-   ##
==========================================
- Coverage   58.55%   56.91%   -1.65%     
==========================================
  Files         701      931     +230     
  Lines       41100    58220   +17120     
==========================================
+ Hits        24068    33138    +9070     
- Misses      14911    22039    +7128     
- Partials     2121     3043     +922     
Flag Coverage Δ
unittests-datacatalog 53.51% <ø> (ø)
unittests-flyteadmin 53.07% <ø> (?)
unittests-flytecopilot 42.06% <31.70%> (-1.00%) ⬇️
unittests-flytectl 64.02% <ø> (ø)
unittests-flyteidl 75.74% <100.00%> (+0.03%) ⬆️
unittests-flyteplugins 60.15% <ø> (ø)
unittests-flytepropeller 53.65% <ø> (ø)
unittests-flytestdlib 63.02% <ø> (-0.25%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

var resultLit *core.Literal
var resultV interface{}
var err error
// TODO: Refactor handleLiteral to accept a list of file paths
Copy link
Contributor Author

@ddl-rliu ddl-rliu Mar 12, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Assuming this PR's general approach is acceptable, I can do this refactor in the same PR, or in a follow-up PR (Also, there can be an improvement here to use symlinks, if the output will be written 2+ times).

The refactor isn't actually needed for correctness, it's more for clarity. The returned resultV, resultLit is identical for calls to handleLiteral that only differ by the value of varPath.

Add a new `file_extension` string field to the BlobType protobuf message,
allowing FlyteFile to optionally specify a file extension (e.g. "csv") that
flytecopilot appends when writing blobs to local disk during the download
phase (e.g. "data.csv"). When empty (the default), behavior is unchanged.

Add a new `enable_legacy_filename` bool field to the BlobType protobuf message,
allowing FlyteFile to optionally specify whether to preserve backward
compatibility for tasks that read from the extensionless path.

Regenerated all protobuf bindings (Go, Python, JS, Rust, ES, Swagger).

Signed-off-by: ddl-rliu <richard.liu@dominodatalab.com>
@ddl-rliu ddl-rliu force-pushed the rliu.DOM-75010.file-ext-copilot branch from 592e227 to 22ff989 Compare March 12, 2026 23:23
Signed-off-by: ddl-rliu <richard.liu@dominodatalab.com>
@ddl-rliu ddl-rliu marked this pull request as ready for review March 12, 2026 23:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[BUG] [copilot] File extensions are missing when copilot downloads Blob/FlyteFile inputs

1 participant