Skip to content

Conversation

@bbrowning
Copy link
Contributor

@bbrowning bbrowning commented Oct 27, 2025

Purpose

This adds a common test suite for tool call parsers and wires all of the existing tool call parsers that had no tests into the common suite. It doesn't yet adapt existing tool call parser tests to fit into the common suite nor augment tool call parsers that already had tests with the new set of common tests. Those tasks can come later, as this PR is already quite large.

Not all of the existing tool call parsers can pass every test in the common test suite. The ones that are not passing today are marked as xfail, and those represent opportunities to identify and fix gaps in all of these tool call parsers in the future until we get down to zero expected to fail tests within the common suite for each parser.

Given how many tests are here, the default_tokenizer fixture used by tests when tokenizing strings was also adjusted to be module-scoped, so we don't create a new version of that for every single test. That keeps test execution fast, and avoids the need to instantiate a new identical tokenizer for every individual test function.

I used Claude Code to help me write the example model outputs for every added test and to help write the initial version of the common set of tests based on existing patterns in our other tool call parser tests.

Test Plan

Run all the newly added tool parser tests:

pytest \
  tests/entrypoints/openai/tool_parsers/test_deepseekv3_tool_parser.py \
  tests/entrypoints/openai/tool_parsers/test_granite_20b_fc_tool_parser.py \
  tests/entrypoints/openai/tool_parsers/test_granite_tool_parser.py \
  tests/entrypoints/openai/tool_parsers/test_internlm2_tool_parser.py \
  tests/entrypoints/openai/tool_parsers/test_longcat_tool_parser.py \
  tests/entrypoints/openai/tool_parsers/test_phi4mini_tool_parser.py \
  tests/entrypoints/openai/tool_parsers/test_qwen3xml_tool_parser.py \
  tests/entrypoints/openai/tool_parsers/test_step3_tool_parser.py

Test Result

tests/entrypoints/openai/tool_parsers/test_deepseekv3_tool_parser.py ...............x.                                                                                                                                                                                                   [ 12%]
tests/entrypoints/openai/tool_parsers/test_granite_20b_fc_tool_parser.py ..........x......                                                                                                                                                                                               [ 24%]
tests/entrypoints/openai/tool_parsers/test_granite_tool_parser.py ..........xx..x.x....                                                                                                                                                                                                  [ 39%]
tests/entrypoints/openai/tool_parsers/test_internlm2_tool_parser.py ..x.x.x.x.x.x..xx                                                                                                                                                                                                    [ 51%]
tests/entrypoints/openai/tool_parsers/test_longcat_tool_parser.py ..............x..                                                                                                                                                                                                      [ 63%]
tests/entrypoints/openai/tool_parsers/test_phi4mini_tool_parser.py x.x.x.xxx.x.x..xx                                                                                                                                                                                                     [ 75%]
tests/entrypoints/openai/tool_parsers/test_qwen3xml_tool_parser.py ..x.x.x.x.x.x.x.x                                                                                                                                                                                                     [ 87%]
tests/entrypoints/openai/tool_parsers/test_step3_tool_parser.py ...xxx.x.x.x.x..x                                                                                                                                                                                                        [100%]

===================== 99 passed, 41 xfailed, 2 warnings in 9.91s =====================

Each of those xfailed tests is a bug in one of the tool call parsers we'll want to track down. The expected failures are marked as strict, so that the test will fail if one of them unexpectedly passes so that we can keep the list of expected failures accurate with the real state of things.

This adds a common test suite for tool call parsers and wires all of the
existing tool call parsers that had no tests into the common suite. It
doesn't yet adapt existing tool call parser tests to fit into the common
suite nor augment tool call parsers that already had tests with the new
set of common tests. Those tasks can come later, as this PR is already
quite large.

Not all of the existing tool call parsers can pass every test in the
common test suite. The ones that are not passing today are marked as
xfail, and those represent opportunities to identify and fix gaps in all
of these tool call parsers in the future until we get down to zero
expected to fail tests within the common suite for each parser.

Given how many tests are here, the default_tokenizer fixture used by
tests when tokenizing strings was also adjusted to be module-scoped, so
we don't create a new version of that for every single test.

Signed-off-by: Ben Browning <bbrownin@redhat.com>
@bbrowning bbrowning force-pushed the 20251006-tool-parser-tests branch from bc9a4d0 to 3dfa4d8 Compare October 27, 2025 18:58
@mergify mergify bot added deepseek Related to DeepSeek models qwen Related to Qwen models tool-calling labels Oct 27, 2025
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a valuable common test suite for tool call parsers, which will greatly improve testing consistency and help identify gaps in different parser implementations. The structure with a configuration dataclass and a test mixin is well-designed. My review focuses on strengthening some of the new common tests to make them more robust and comprehensive. Specifically, I've suggested improvements to test_various_data_types to validate parsed values and to test_streaming_reconstruction for a more complete comparison between streaming and non-streaming outputs.

@DarkLight1337
Copy link
Member

cc @aarnphm @chaunceyjiang

@bbrowning
Copy link
Contributor Author

I had tests for the mistral tool parser as part of this as well locally, but decided to wait on adding tests for that parser until #19425 lands since that PR adds an initial mistral tool parser test suite and I didn't want to cause a rebase headache there since that other PR is already quite large.

@bbrowning
Copy link
Contributor Author

I created #27661 to track the overall arc I'm working towards here for broader context as to why I'm adding a common test suite and expanding the tests across all parsers. To briefly recap, these tests serve double duty of identifying existing bugs across parsers and de-risking a future refactor of tool call parsers by ensuring we have comprehensive test coverage.

This tightens up the data type checking in the common tool call parser
test suite to ensure parsers are not only parsing various data types of
function arguments, but also that they are parsed into the expected
Python type.

The XML-based parsers do not support parsing into any data type but
string, so there's flag added to control this stricter behavior so that
tool call parsers that cannot deal with parsing different data types
into their non-string native types are excluded from this checking.

Signed-off-by: Ben Browning <bbrownin@redhat.com>
@bbrowning
Copy link
Contributor Author

The precommit failed here due to #27811 . I confirmed the ruff failures there were unrelated to this change.

@DarkLight1337 DarkLight1337 added the ready ONLY add when PR is ready to merge/full CI is needed label Oct 30, 2025
@bbrowning
Copy link
Contributor Author

Looks like this CI run failed with the same flake I previously reported as #27576 .

@bbrowning
Copy link
Contributor Author

Adding a note to myself and any future reviewers that if #27747 lands before this PR, this PR needs to update the location of the new tests it's adding to align with the test reorganization in 27747.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

deepseek Related to DeepSeek models qwen Related to Qwen models ready ONLY add when PR is ready to merge/full CI is needed tool-calling

Projects

Status: No status

Development

Successfully merging this pull request may close these issues.

3 participants