[CI/Build] Add common tool call parser test suite #27599

bbrowning · 2025-10-27T18:57:45Z

Purpose

This adds a common test suite for tool call parsers and wires all of the existing tool call parsers that had no tests into the common suite. It doesn't yet adapt existing tool call parser tests to fit into the common suite nor augment tool call parsers that already had tests with the new set of common tests. Those tasks can come later, as this PR is already quite large.

Not all of the existing tool call parsers can pass every test in the common test suite. The ones that are not passing today are marked as xfail, and those represent opportunities to identify and fix gaps in all of these tool call parsers in the future until we get down to zero expected to fail tests within the common suite for each parser.

Given how many tests are here, the default_tokenizer fixture used by tests when tokenizing strings was also adjusted to be module-scoped, so we don't create a new version of that for every single test. That keeps test execution fast, and avoids the need to instantiate a new identical tokenizer for every individual test function.

I used Claude Code to help me write the example model outputs for every added test and to help write the initial version of the common set of tests based on existing patterns in our other tool call parser tests.

Test Plan

Run all the newly added tool parser tests:

pytest \
  tests/entrypoints/openai/tool_parsers/test_deepseekv3_tool_parser.py \
  tests/entrypoints/openai/tool_parsers/test_granite_20b_fc_tool_parser.py \
  tests/entrypoints/openai/tool_parsers/test_granite_tool_parser.py \
  tests/entrypoints/openai/tool_parsers/test_internlm2_tool_parser.py \
  tests/entrypoints/openai/tool_parsers/test_longcat_tool_parser.py \
  tests/entrypoints/openai/tool_parsers/test_phi4mini_tool_parser.py \
  tests/entrypoints/openai/tool_parsers/test_qwen3xml_tool_parser.py \
  tests/entrypoints/openai/tool_parsers/test_step3_tool_parser.py

Test Result

tests/entrypoints/openai/tool_parsers/test_deepseekv3_tool_parser.py ...............x.                                                                                                                                                                                                   [ 12%]
tests/entrypoints/openai/tool_parsers/test_granite_20b_fc_tool_parser.py ..........x......                                                                                                                                                                                               [ 24%]
tests/entrypoints/openai/tool_parsers/test_granite_tool_parser.py ..........xx..x.x....                                                                                                                                                                                                  [ 39%]
tests/entrypoints/openai/tool_parsers/test_internlm2_tool_parser.py ..x.x.x.x.x.x..xx                                                                                                                                                                                                    [ 51%]
tests/entrypoints/openai/tool_parsers/test_longcat_tool_parser.py ..............x..                                                                                                                                                                                                      [ 63%]
tests/entrypoints/openai/tool_parsers/test_phi4mini_tool_parser.py x.x.x.xxx.x.x..xx                                                                                                                                                                                                     [ 75%]
tests/entrypoints/openai/tool_parsers/test_qwen3xml_tool_parser.py ..x.x.x.x.x.x.x.x                                                                                                                                                                                                     [ 87%]
tests/entrypoints/openai/tool_parsers/test_step3_tool_parser.py ...xxx.x.x.x.x..x                                                                                                                                                                                                        [100%]

===================== 99 passed, 41 xfailed, 2 warnings in 9.91s =====================

Each of those xfailed tests is a bug in one of the tool call parsers we'll want to track down. The expected failures are marked as strict, so that the test will fail if one of them unexpectedly passes so that we can keep the list of expected failures accurate with the real state of things.

This adds a common test suite for tool call parsers and wires all of the existing tool call parsers that had no tests into the common suite. It doesn't yet adapt existing tool call parser tests to fit into the common suite nor augment tool call parsers that already had tests with the new set of common tests. Those tasks can come later, as this PR is already quite large. Not all of the existing tool call parsers can pass every test in the common test suite. The ones that are not passing today are marked as xfail, and those represent opportunities to identify and fix gaps in all of these tool call parsers in the future until we get down to zero expected to fail tests within the common suite for each parser. Given how many tests are here, the default_tokenizer fixture used by tests when tokenizing strings was also adjusted to be module-scoped, so we don't create a new version of that for every single test. Signed-off-by: Ben Browning <bbrownin@redhat.com>

gemini-code-assist

Code Review

This pull request introduces a valuable common test suite for tool call parsers, which will greatly improve testing consistency and help identify gaps in different parser implementations. The structure with a configuration dataclass and a test mixin is well-designed. My review focuses on strengthening some of the new common tests to make them more robust and comprehensive. Specifically, I've suggested improvements to test_various_data_types to validate parsed values and to test_streaming_reconstruction for a more complete comparison between streaming and non-streaming outputs.

tests/entrypoints/openai/tool_parsers/common_tests.py

DarkLight1337 · 2025-10-28T04:34:22Z

cc @aarnphm @chaunceyjiang

bbrowning · 2025-10-28T14:11:22Z

I had tests for the mistral tool parser as part of this as well locally, but decided to wait on adding tests for that parser until #19425 lands since that PR adds an initial mistral tool parser test suite and I didn't want to cause a rebase headache there since that other PR is already quite large.

bbrowning · 2025-10-28T15:31:41Z

I created #27661 to track the overall arc I'm working towards here for broader context as to why I'm adding a common test suite and expanding the tests across all parsers. To briefly recap, these tests serve double duty of identifying existing bugs across parsers and de-risking a future refactor of tool call parsers by ensuring we have comprehensive test coverage.

This tightens up the data type checking in the common tool call parser test suite to ensure parsers are not only parsing various data types of function arguments, but also that they are parsed into the expected Python type. The XML-based parsers do not support parsing into any data type but string, so there's flag added to control this stricter behavior so that tool call parsers that cannot deal with parsing different data types into their non-string native types are excluded from this checking. Signed-off-by: Ben Browning <bbrownin@redhat.com>

bbrowning · 2025-10-30T16:16:02Z

The precommit failed here due to #27811 . I confirmed the ruff failures there were unrelated to this change.

bbrowning · 2025-10-31T15:49:26Z

Looks like this CI run failed with the same flake I previously reported as #27576 .

bbrowning · 2025-10-31T17:27:47Z

Adding a note to myself and any future reviewers that if #27747 lands before this PR, this PR needs to update the location of the new tests it's adding to align with the test reorganization in 27747.

…r-tests

bbrowning requested review from DarkLight1337, NickLucche, aarnphm, robertgshaw2-redhat and simon-mo as code owners October 27, 2025 18:57

bbrowning force-pushed the 20251006-tool-parser-tests branch from bc9a4d0 to 3dfa4d8 Compare October 27, 2025 18:58

mergify bot added deepseek Related to DeepSeek models qwen Related to Qwen models tool-calling labels Oct 27, 2025

github-project-automation bot added this to Tool Calling Oct 27, 2025

gemini-code-assist bot reviewed Oct 27, 2025

View reviewed changes

tests/entrypoints/openai/tool_parsers/common_tests.py Show resolved Hide resolved

tests/entrypoints/openai/tool_parsers/common_tests.py Show resolved Hide resolved

chaunceyjiang self-assigned this Oct 28, 2025

bbrowning mentioned this pull request Oct 28, 2025

[RFC]: Consolidated tool call parser implementations by type (JSON, Python, XML, Harmony) #27661

Open

10 tasks

DarkLight1337 added the ready ONLY add when PR is ready to merge/full CI is needed label Oct 30, 2025

Merge branch 'main' into 20251006-tool-parser-tests

39fda21

Merge remote-tracking branch 'upstream/main' into 20251006-tool-parse…

ffdc985

…r-tests

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[CI/Build] Add common tool call parser test suite #27599

[CI/Build] Add common tool call parser test suite #27599

bbrowning commented Oct 27, 2025 •

edited by github-actions bot

Loading

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

Uh oh!

Uh oh!

DarkLight1337 commented Oct 28, 2025

Uh oh!

bbrowning commented Oct 28, 2025

Uh oh!

bbrowning commented Oct 28, 2025

Uh oh!

bbrowning commented Oct 30, 2025

Uh oh!

bbrowning commented Oct 31, 2025

Uh oh!

bbrowning commented Oct 31, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

[CI/Build] Add common tool call parser test suite #27599

Are you sure you want to change the base?

[CI/Build] Add common tool call parser test suite #27599

Conversation

bbrowning commented Oct 27, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Test Plan

Test Result

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

DarkLight1337 commented Oct 28, 2025

Uh oh!

bbrowning commented Oct 28, 2025

Uh oh!

bbrowning commented Oct 28, 2025

Uh oh!

bbrowning commented Oct 30, 2025

Uh oh!

bbrowning commented Oct 31, 2025

Uh oh!

bbrowning commented Oct 31, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

bbrowning commented Oct 27, 2025 •

edited by github-actions bot

Loading