Skip to content

Conversation

@tokoko
Copy link
Contributor

@tokoko tokoko commented Oct 18, 2025

changes builder api for all functions so one can easily pass multiple function refs to the builder.

@github-actions
Copy link

ACTION NEEDED

Substrait follows the Conventional Commits
specification
for
release automation.

The PR title and description are used as the merge commit message. Please update your PR title and description to match the specification.

@tokoko
Copy link
Contributor Author

tokoko commented Oct 18, 2025

  • the builder will try looking up each function in the registry and throw an exception if none of them matches. this is practical because substrait often splits logically related functions in separate extensions, for example ["functions_arithmetic.yaml:add", "functions_arithmetic_decimal.yaml:add"] while the user often doesn't care which one is used as long as the one with correct input types can be located.
  • this should also make transition to urns a bit simpler as uri arg will no longer be part of the api.

it's a breaking change, but easy enough to change even if someone already actually depends on this stuff.

@nielspardon
Copy link
Member

I'm wondering whether we should have a definition on the Substrait spec level on how to handle function extension merging / prioritization / resolution. @benbellick recently introduced a change in substrait-java/isthmus that uses a priority order for extension YAML files in order to resolve duplicate function signatures across files but only when mapping to/from Calcite. I'm not sure how all the other Substrait implementations handle this. I assume that it would be good to have some consistency on this aspect. What do you think?

@tokoko
Copy link
Contributor Author

tokoko commented Oct 28, 2025

as far as I understand, prioritization is an issue only if you try to look for functions by name only, right? The way ExtensionRegistry in substait-python is implemented right now, it always asks for uri/urn and name to locate the function, so there is not really a place for duplicates as long as extension files themselves are valid and urns are in fact unique.

The design proposed in this PR expects input to look something like this -> ["functions_arithmetic.yaml:add", "functions_arithmetic_decimal.yaml:add"] (or urn equivalent once the other PR is merged). We can of course also have an option of searching by name only (add) and the prioritization will come into picture, but we currently don't have anything like that.

@nielspardon
Copy link
Member

nielspardon commented Oct 28, 2025

right, these are different yet similar approaches.

since Isthmus may start from a SQL string it might only have the name and it would need to find the right function across a set of YAML file by identifying the function with the matching function signature. Then if you had multiple functions with the same signature the YAML file priority comes in.

In substrait-python you identify the function by URN/URI and function name and you try to allow defining which ones to pick from. Which also allows one to define priority on the function level since I guess it would pick the first one that matches the data types.

@tokoko tokoko marked this pull request as draft October 30, 2025 15:00
@tokoko tokoko closed this Oct 30, 2025
@tokoko tokoko force-pushed the multiple-functions branch from f2a4550 to 890f84b Compare October 30, 2025 15:01
@tokoko tokoko reopened this Oct 30, 2025
@tokoko tokoko marked this pull request as ready for review October 30, 2025 20:15
@tokoko
Copy link
Contributor Author

tokoko commented Oct 30, 2025

This is back on the market after incorporating urn changes. The api for single function now looks like this -> scalar_function("extension:io.substrait:functions_comparison:gte", expressions=[...]). Although the initial intent was simply to concatenate extension urn and function name, looking at it now.. it feels like the resulting string is almost like a function urn of sorts. In that case, would something like extension_function:io.substrait:functions_comparison:gte make more sense? just an idea, curious what you think @benbellick

@benbellick
Copy link
Member

@nielspardon I felt that this change in substrait-java was necessary specifically because it is indeed valid substrait to have multiple functions with the same implementation but different urns. However, the calcite conversion caused an expected exception to be thrown if that was the case. Tt seemed like a questionable API to have valid substrait YAML files inadvertently cause a crash in a specific usage of the library. That being said, I am in no way wed to the resolution strategy I used there. I honestly chose it because it was simplest to implement, not because it was "best".

Copy link
Member

@benbellick benbellick left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See comment

def scalar_function(
urn: str,
function: str,
function: Union[str, Iterable[str]],
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tokoko I am open minded to this approach, but I would strongly prefer not introducing string parsing if it is possible. Instead, maybe the API could be something like:

Suggested change
function: Union[str, Iterable[str]],
function: Union[ExtensionID, Iterable[ExtensionID]],

where

@dataclass
class ExtensionID:
  urn: str
  function: str

What do you think?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would also be open minded to just having a separate function for the individual and the list case.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not a fan of parsing strings either. I simply wanted not to clutter the api too much. How about using NamedTuple instead of a dataclass. It would allow the user to pass plain tuples as well.

from typing import NamedTuple, Union, Iterable

class ExtensionFunction(NamedTuple):
    urn: str
    function: str

def process_func(func: Union[ExtensionFunction, Iterable[ExtensionFunction]]):
    functions = [func] if isinstance(func[0], str) else func
    for f in functions:
        urn, name = f
        print(urn)
        print(name)

process_func(ExtensionFunction("sample_urn", "sample_func"))
process_func(("sample_urn", "sample_func"))
process_func([("sample_urn", "sample_func1"), ("sample_urn", "sample_func2")])

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am okay with that approach as an API for allowing multiple parameters, though I still have my hesitations about the PR as a whole as expressed in the below comment.

@benbellick benbellick self-requested a review October 31, 2025 18:50
Copy link
Member

@benbellick benbellick left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Upon further thought, I am not sure I see the value of this. IIUC, what is happening is we find the first function with matching implementation in the list and use that one for extended_expression construction?

To be this seems to be a lot less explicit than the current implementation.

You write:

while the user often doesn't care which one is used as long as the one with correct input types can be located.

Is that really true in general? The point of having different functions is to have different semantics after all, and I think the user should be expected to be as clear as possible which implementation they intend on using, no?

If we wanted to offer a facility for looking up the first matching function by name and signature irrespective of uri/urn, then I could imagine something like that in the ExtensionRegistry.

If you feel strongly that this feature would be useful, then to me it makes more sense to exist as a separate helper function, rather than the canonical builder function. What do you think?

@tokoko
Copy link
Contributor Author

tokoko commented Oct 31, 2025

Is that really true in general? The point of having different functions is to have different semantics after all, and I think the user should be expected to be as clear as possible which implementation they intend on using, no?

The immediate use case for this is sql conversion. + from sql can map to either an add in functions_arithmetic or functions_arithmetic_decimal. I think using these sorts of arithmetic functions regardless of the types involved will be a pretty common occurrence. It's not just sql. If I remember correctly, the same was true in ibis to substrait conversion code as well. ibis didn't distinguish between them, while substrait did. My point is that this sort of "go through multiple functions and see which one fits" is bound to be implemented somewhere regardless.

If you feel strongly that this feature would be useful, then to me it makes more sense to exist as a separate helper function, rather than the canonical builder function. What do you think?

I don't disagree that it's a "helper", but to be honest I sort of treated builders as "helpers" rather than strictly canonical in the first place. For example there are ways to provide columns by name (rather than an index) to a projection which is hardly canonical for substrait. I'm not trying to use that as an argument, though... maybe we should in fact have a clearer distinction between canonical and helper builders.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants