-
Notifications
You must be signed in to change notification settings - Fork 15
feat: allow passing multiple functions to function builders #107
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
|
ACTION NEEDED Substrait follows the Conventional Commits The PR title and description are used as the merge commit message. Please update your PR title and description to match the specification. |
it's a breaking change, but easy enough to change even if someone already actually depends on this stuff. |
|
I'm wondering whether we should have a definition on the Substrait spec level on how to handle function extension merging / prioritization / resolution. @benbellick recently introduced a change in substrait-java/isthmus that uses a priority order for extension YAML files in order to resolve duplicate function signatures across files but only when mapping to/from Calcite. I'm not sure how all the other Substrait implementations handle this. I assume that it would be good to have some consistency on this aspect. What do you think? |
|
as far as I understand, prioritization is an issue only if you try to look for functions by name only, right? The way ExtensionRegistry in substait-python is implemented right now, it always asks for uri/urn and name to locate the function, so there is not really a place for duplicates as long as extension files themselves are valid and urns are in fact unique. The design proposed in this PR expects input to look something like this -> |
|
right, these are different yet similar approaches. since Isthmus may start from a SQL string it might only have the name and it would need to find the right function across a set of YAML file by identifying the function with the matching function signature. Then if you had multiple functions with the same signature the YAML file priority comes in. In substrait-python you identify the function by URN/URI and function name and you try to allow defining which ones to pick from. Which also allows one to define priority on the function level since I guess it would pick the first one that matches the data types. |
f2a4550 to
890f84b
Compare
|
This is back on the market after incorporating urn changes. The api for single function now looks like this -> |
|
@nielspardon I felt that this change in substrait-java was necessary specifically because it is indeed valid substrait to have multiple functions with the same implementation but different |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See comment
| def scalar_function( | ||
| urn: str, | ||
| function: str, | ||
| function: Union[str, Iterable[str]], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@tokoko I am open minded to this approach, but I would strongly prefer not introducing string parsing if it is possible. Instead, maybe the API could be something like:
| function: Union[str, Iterable[str]], | |
| function: Union[ExtensionID, Iterable[ExtensionID]], |
where
@dataclass
class ExtensionID:
urn: str
function: strWhat do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would also be open minded to just having a separate function for the individual and the list case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not a fan of parsing strings either. I simply wanted not to clutter the api too much. How about using NamedTuple instead of a dataclass. It would allow the user to pass plain tuples as well.
from typing import NamedTuple, Union, Iterable
class ExtensionFunction(NamedTuple):
urn: str
function: str
def process_func(func: Union[ExtensionFunction, Iterable[ExtensionFunction]]):
functions = [func] if isinstance(func[0], str) else func
for f in functions:
urn, name = f
print(urn)
print(name)
process_func(ExtensionFunction("sample_urn", "sample_func"))
process_func(("sample_urn", "sample_func"))
process_func([("sample_urn", "sample_func1"), ("sample_urn", "sample_func2")])
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am okay with that approach as an API for allowing multiple parameters, though I still have my hesitations about the PR as a whole as expressed in the below comment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Upon further thought, I am not sure I see the value of this. IIUC, what is happening is we find the first function with matching implementation in the list and use that one for extended_expression construction?
To be this seems to be a lot less explicit than the current implementation.
You write:
while the user often doesn't care which one is used as long as the one with correct input types can be located.
Is that really true in general? The point of having different functions is to have different semantics after all, and I think the user should be expected to be as clear as possible which implementation they intend on using, no?
If we wanted to offer a facility for looking up the first matching function by name and signature irrespective of uri/urn, then I could imagine something like that in the ExtensionRegistry.
If you feel strongly that this feature would be useful, then to me it makes more sense to exist as a separate helper function, rather than the canonical builder function. What do you think?
The immediate use case for this is sql conversion.
I don't disagree that it's a "helper", but to be honest I sort of treated builders as "helpers" rather than strictly canonical in the first place. For example there are ways to provide columns by name (rather than an index) to a projection which is hardly canonical for substrait. I'm not trying to use that as an argument, though... maybe we should in fact have a clearer distinction between canonical and helper builders. |
changes builder api for all functions so one can easily pass multiple function refs to the builder.