feat: allow passing multiple functions to function builders #107

tokoko · 2025-10-18T12:52:30Z

changes builder api for all functions so one can easily pass multiple function refs to the builder.

github-actions · 2025-10-18T12:52:46Z

ACTION NEEDED

Substrait follows the Conventional Commits
specification for
release automation.

The PR title and description are used as the merge commit message. Please update your PR title and description to match the specification.

tokoko · 2025-10-18T12:59:28Z

the builder will try looking up each function in the registry and throw an exception if none of them matches. this is practical because substrait often splits logically related functions in separate extensions, for example ["functions_arithmetic.yaml:add", "functions_arithmetic_decimal.yaml:add"] while the user often doesn't care which one is used as long as the one with correct input types can be located.
this should also make transition to urns a bit simpler as uri arg will no longer be part of the api.

it's a breaking change, but easy enough to change even if someone already actually depends on this stuff.

nielspardon · 2025-10-28T09:13:52Z

I'm wondering whether we should have a definition on the Substrait spec level on how to handle function extension merging / prioritization / resolution. @benbellick recently introduced a change in substrait-java/isthmus that uses a priority order for extension YAML files in order to resolve duplicate function signatures across files but only when mapping to/from Calcite. I'm not sure how all the other Substrait implementations handle this. I assume that it would be good to have some consistency on this aspect. What do you think?

tokoko · 2025-10-28T11:55:17Z

as far as I understand, prioritization is an issue only if you try to look for functions by name only, right? The way ExtensionRegistry in substait-python is implemented right now, it always asks for uri/urn and name to locate the function, so there is not really a place for duplicates as long as extension files themselves are valid and urns are in fact unique.

The design proposed in this PR expects input to look something like this -> ["functions_arithmetic.yaml:add", "functions_arithmetic_decimal.yaml:add"] (or urn equivalent once the other PR is merged). We can of course also have an option of searching by name only (add) and the prioritization will come into picture, but we currently don't have anything like that.

nielspardon · 2025-10-28T13:22:07Z

right, these are different yet similar approaches.

since Isthmus may start from a SQL string it might only have the name and it would need to find the right function across a set of YAML file by identifying the function with the matching function signature. Then if you had multiple functions with the same signature the YAML file priority comes in.

In substrait-python you identify the function by URN/URI and function name and you try to allow defining which ones to pick from. Which also allows one to define priority on the function level since I guess it would pick the first one that matches the data types.

tokoko · 2025-10-30T20:34:37Z

This is back on the market after incorporating urn changes. The api for single function now looks like this -> scalar_function("extension:io.substrait:functions_comparison:gte", expressions=[...]). Although the initial intent was simply to concatenate extension urn and function name, looking at it now.. it feels like the resulting string is almost like a function urn of sorts. In that case, would something like extension_function:io.substrait:functions_comparison:gte make more sense? just an idea, curious what you think @benbellick

benbellick · 2025-10-31T18:35:31Z

@nielspardon I felt that this change in substrait-java was necessary specifically because it is indeed valid substrait to have multiple functions with the same implementation but different urns. However, the calcite conversion caused an expected exception to be thrown if that was the case. Tt seemed like a questionable API to have valid substrait YAML files inadvertently cause a crash in a specific usage of the library. That being said, I am in no way wed to the resolution strategy I used there. I honestly chose it because it was simplest to implement, not because it was "best".

benbellick

See comment

benbellick · 2025-10-31T18:40:10Z

src/substrait/builders/extended_expression.py

 def scalar_function(
-    urn: str,
-    function: str,
+    function: Union[str, Iterable[str]],


@tokoko I am open minded to this approach, but I would strongly prefer not introducing string parsing if it is possible. Instead, maybe the API could be something like:

Suggested change

function: Union[str, Iterable[str]],

function: Union[ExtensionID, Iterable[ExtensionID]],

where

@dataclass class ExtensionID: urn: str function: str

What do you think?

I would also be open minded to just having a separate function for the individual and the list case.

I'm not a fan of parsing strings either. I simply wanted not to clutter the api too much. How about using NamedTuple instead of a dataclass. It would allow the user to pass plain tuples as well.

from typing import NamedTuple, Union, Iterable class ExtensionFunction(NamedTuple): urn: str function: str def process_func(func: Union[ExtensionFunction, Iterable[ExtensionFunction]]): functions = [func] if isinstance(func[0], str) else func for f in functions: urn, name = f print(urn) print(name) process_func(ExtensionFunction("sample_urn", "sample_func")) process_func(("sample_urn", "sample_func")) process_func([("sample_urn", "sample_func1"), ("sample_urn", "sample_func2")])

I am okay with that approach as an API for allowing multiple parameters, though I still have my hesitations about the PR as a whole as expressed in the below comment.

benbellick

Upon further thought, I am not sure I see the value of this. IIUC, what is happening is we find the first function with matching implementation in the list and use that one for extended_expression construction?

To be this seems to be a lot less explicit than the current implementation.

You write:

while the user often doesn't care which one is used as long as the one with correct input types can be located.

Is that really true in general? The point of having different functions is to have different semantics after all, and I think the user should be expected to be as clear as possible which implementation they intend on using, no?

If we wanted to offer a facility for looking up the first matching function by name and signature irrespective of uri/urn, then I could imagine something like that in the ExtensionRegistry.

If you feel strongly that this feature would be useful, then to me it makes more sense to exist as a separate helper function, rather than the canonical builder function. What do you think?

tokoko · 2025-10-31T19:39:30Z

Is that really true in general? The point of having different functions is to have different semantics after all, and I think the user should be expected to be as clear as possible which implementation they intend on using, no?

The immediate use case for this is sql conversion. + from sql can map to either an add in functions_arithmetic or functions_arithmetic_decimal. I think using these sorts of arithmetic functions regardless of the types involved will be a pretty common occurrence. It's not just sql. If I remember correctly, the same was true in ibis to substrait conversion code as well. ibis didn't distinguish between them, while substrait did. My point is that this sort of "go through multiple functions and see which one fits" is bound to be implemented somewhere regardless.

If you feel strongly that this feature would be useful, then to me it makes more sense to exist as a separate helper function, rather than the canonical builder function. What do you think?

I don't disagree that it's a "helper", but to be honest I sort of treated builders as "helpers" rather than strictly canonical in the first place. For example there are ways to provide columns by name (rather than an index) to a projection which is hardly canonical for substrait. I'm not trying to use that as an argument, though... maybe we should in fact have a clearer distinction between canonical and helper builders.

tokoko marked this pull request as draft October 30, 2025 15:00

tokoko closed this Oct 30, 2025

tokoko force-pushed the multiple-functions branch from f2a4550 to 890f84b Compare October 30, 2025 15:01

feat: allow passing multiple functions to function builders

4094d14

tokoko reopened this Oct 30, 2025

fix examples

5c3e317

tokoko marked this pull request as ready for review October 30, 2025 20:15

benbellick reviewed Oct 31, 2025

View reviewed changes

benbellick self-requested a review October 31, 2025 18:50

benbellick reviewed Oct 31, 2025

View reviewed changes

	function: Union[str, Iterable[str]],
	function: Union[ExtensionID, Iterable[ExtensionID]],

feat: allow passing multiple functions to function builders #107

Are you sure you want to change the base?

feat: allow passing multiple functions to function builders #107

Uh oh!

Conversation

tokoko commented Oct 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented Oct 18, 2025

Uh oh!

tokoko commented Oct 18, 2025

Uh oh!

nielspardon commented Oct 28, 2025

Uh oh!

tokoko commented Oct 28, 2025

Uh oh!

nielspardon commented Oct 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

tokoko commented Oct 30, 2025

Uh oh!

benbellick commented Oct 31, 2025

Uh oh!

benbellick left a comment

Choose a reason for hiding this comment

Uh oh!

benbellick Oct 31, 2025

Choose a reason for hiding this comment

Uh oh!

benbellick Oct 31, 2025

Choose a reason for hiding this comment

Uh oh!

tokoko Oct 31, 2025

Choose a reason for hiding this comment

Uh oh!

benbellick Oct 31, 2025

Choose a reason for hiding this comment

Uh oh!

benbellick left a comment

Choose a reason for hiding this comment

Uh oh!

tokoko commented Oct 31, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

tokoko commented Oct 18, 2025 •

edited

Loading

nielspardon commented Oct 28, 2025 •

edited

Loading