-
-
Notifications
You must be signed in to change notification settings - Fork 11.4k
[CI/Build] Add common tool call parser test suite #27599
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
This adds a common test suite for tool call parsers and wires all of the existing tool call parsers that had no tests into the common suite. It doesn't yet adapt existing tool call parser tests to fit into the common suite nor augment tool call parsers that already had tests with the new set of common tests. Those tasks can come later, as this PR is already quite large. Not all of the existing tool call parsers can pass every test in the common test suite. The ones that are not passing today are marked as xfail, and those represent opportunities to identify and fix gaps in all of these tool call parsers in the future until we get down to zero expected to fail tests within the common suite for each parser. Given how many tests are here, the default_tokenizer fixture used by tests when tokenizing strings was also adjusted to be module-scoped, so we don't create a new version of that for every single test. Signed-off-by: Ben Browning <bbrownin@redhat.com>
bc9a4d0 to
3dfa4d8
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces a valuable common test suite for tool call parsers, which will greatly improve testing consistency and help identify gaps in different parser implementations. The structure with a configuration dataclass and a test mixin is well-designed. My review focuses on strengthening some of the new common tests to make them more robust and comprehensive. Specifically, I've suggested improvements to test_various_data_types to validate parsed values and to test_streaming_reconstruction for a more complete comparison between streaming and non-streaming outputs.
|
I had tests for the mistral tool parser as part of this as well locally, but decided to wait on adding tests for that parser until #19425 lands since that PR adds an initial mistral tool parser test suite and I didn't want to cause a rebase headache there since that other PR is already quite large. |
|
I created #27661 to track the overall arc I'm working towards here for broader context as to why I'm adding a common test suite and expanding the tests across all parsers. To briefly recap, these tests serve double duty of identifying existing bugs across parsers and de-risking a future refactor of tool call parsers by ensuring we have comprehensive test coverage. |
This tightens up the data type checking in the common tool call parser test suite to ensure parsers are not only parsing various data types of function arguments, but also that they are parsed into the expected Python type. The XML-based parsers do not support parsing into any data type but string, so there's flag added to control this stricter behavior so that tool call parsers that cannot deal with parsing different data types into their non-string native types are excluded from this checking. Signed-off-by: Ben Browning <bbrownin@redhat.com>
|
The precommit failed here due to #27811 . I confirmed the ruff failures there were unrelated to this change. |
|
Looks like this CI run failed with the same flake I previously reported as #27576 . |
|
Adding a note to myself and any future reviewers that if #27747 lands before this PR, this PR needs to update the location of the new tests it's adding to align with the test reorganization in 27747. |
Purpose
This adds a common test suite for tool call parsers and wires all of the existing tool call parsers that had no tests into the common suite. It doesn't yet adapt existing tool call parser tests to fit into the common suite nor augment tool call parsers that already had tests with the new set of common tests. Those tasks can come later, as this PR is already quite large.
Not all of the existing tool call parsers can pass every test in the common test suite. The ones that are not passing today are marked as xfail, and those represent opportunities to identify and fix gaps in all of these tool call parsers in the future until we get down to zero expected to fail tests within the common suite for each parser.
Given how many tests are here, the default_tokenizer fixture used by tests when tokenizing strings was also adjusted to be module-scoped, so we don't create a new version of that for every single test. That keeps test execution fast, and avoids the need to instantiate a new identical tokenizer for every individual test function.
I used Claude Code to help me write the example model outputs for every added test and to help write the initial version of the common set of tests based on existing patterns in our other tool call parser tests.
Test Plan
Run all the newly added tool parser tests:
Test Result
Each of those
xfailedtests is a bug in one of the tool call parsers we'll want to track down. The expected failures are marked asstrict, so that the test will fail if one of them unexpectedly passes so that we can keep the list of expected failures accurate with the real state of things.