Skip to content

Add FMAPI tool calling contract tests for DatabricksOpenAI#348

Open
dhruv0811 wants to merge 4 commits intomainfrom
fmapi-tool-calling-contract-tests
Open

Add FMAPI tool calling contract tests for DatabricksOpenAI#348
dhruv0811 wants to merge 4 commits intomainfrom
fmapi-tool-calling-contract-tests

Conversation

@dhruv0811
Copy link
Contributor

@dhruv0811 dhruv0811 commented Feb 24, 2026

Summary

End-to-end FMAPI tool calling integration tests for DatabricksOpenAI and ChatDatabricks (LangGraph), mirroring user code patterns from app-templates. Both #269 (strict field) and #333 (empty assistant content) were caught by customers — these tests ensure those CUJs don't regress.

Also includes a bug fix for Gemini models on FMAPI (details below).

Tests

Models are dynamically discovered via workspace_client.serving_endpoints.list(), filtered to databricks-* + llm/v1/chat, then probed for tool calling support. Tests retry up to 3 times.

OpenAI (Agents SDK + McpServer):

  • Single-turn, multi-turn, streaming via Runner.run / Runner.run_streamed

LangChain (LangGraph create_react_agent):

  • Single-turn, multi-turn, streaming via agent.invoke / agent.ainvoke / agent.stream / agent.astream

Gated behind RUN_FMAPI_TOOL_CALLING_TESTS=1.

Bug fix: Gemini FMAPI tool calling compatibility

Gemini FMAPI doesn't conform to the OpenAI API spec in two ways during tool calling:

1. Request side — Rejects tool messages where content is a list of content blocks (e.g. [{"type": "text", "text": "hello"}]). The Agents SDK always produces this format when using MCP tools. Fix: _flatten_list_content_in_messages() flattens to a plain string before sending.

2. Response side (streaming) — Returns delta.content as a list instead of a string in streaming responses. The Agents SDK crashes with ValidationError: Input should be a valid string, input_type=list. Fix: _GeminiStreamWrapper / _AsyncGeminiStreamWrapper intercept stream chunks and flatten list content.

Reproduce request-side issue:

# Tool result with content as LIST — FAILS on Gemini FMAPI
curl -s -X POST "$HOST/serving-endpoints/databricks-gemini-2-5-flash/invocations" \
  -H "Authorization: Bearer $TOKEN" -H "Content-Type: application/json" \
  -d '{"model":"databricks-gemini-2-5-flash","messages":[
    {"role":"user","content":"Echo hello"},
    {"role":"assistant","content":null,"tool_calls":[{"id":"echo","type":"function","function":{"name":"echo","arguments":"{\"msg\":\"hello\"}"}}]},
    {"role":"tool","tool_call_id":"echo","content":[{"type":"text","text":"hello"}]}
  ],"tools":[{"type":"function","function":{"name":"echo","description":"Echo","parameters":{"type":"object","properties":{"msg":{"type":"string"}},"required":["msg"]}}}],"max_tokens":100}'
# → 400: "Expecting 'content' to be a String"

# Same request with content as STRING — WORKS
# (change "content":[{"type":"text","text":"hello"}] to "content":"hello")

Note: Gemini 2.5 Pro + LangChain

Gemini 2.5 Pro is a reasoning model that consumes 200-600 reasoning tokens from the max_tokens budget before producing output. LangChain tests use max_tokens=1000 for this model (vs 200 for others) to accommodate the reasoning overhead.

Known model issues (skipped)

Model Issue
gpt-5-nano Too small for reliable tool calling
gpt-oss-20b, gpt-oss-120b, llama-4-maverick Hallucinates tool names
gemini-3-flash, gemini-3-pro, gemini-3-1-pro Requires thought_signature (Gemini 3.x)
gemma-3-12b (LangChain only) Outputs raw tool call text instead of executing tools

Test plan

@dhruv0811 dhruv0811 force-pushed the fmapi-tool-calling-contract-tests branch 6 times, most recently from ab0acbd to 01579ee Compare February 24, 2026 23:56
@dhruv0811 dhruv0811 requested a review from bbqiu February 25, 2026 00:02
Copy link
Collaborator

@bbqiu bbqiu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this looks great! left two comments

will there be a separate PR for the payloads that langgraph agents will generate?

@dhruv0811 dhruv0811 force-pushed the fmapi-tool-calling-contract-tests branch 4 times, most recently from c269d7d to 888f3e9 Compare February 26, 2026 20:59
@dhruv0811 dhruv0811 force-pushed the fmapi-tool-calling-contract-tests branch 15 times, most recently from aaf77fe to 80a7a4e Compare February 27, 2026 00:51
@dhruv0811 dhruv0811 force-pushed the fmapi-tool-calling-contract-tests branch 5 times, most recently from ab54296 to bc97991 Compare February 27, 2026 01:03
@dhruv0811 dhruv0811 requested a review from bbqiu February 27, 2026 03:43
@dhruv0811 dhruv0811 force-pushed the fmapi-tool-calling-contract-tests branch 8 times, most recently from be72c8c to 2c2cb38 Compare March 2, 2026 21:31
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@dhruv0811 dhruv0811 force-pushed the fmapi-tool-calling-contract-tests branch from e768ccb to aa23d97 Compare March 2, 2026 21:42
dhruv0811 and others added 3 commits March 2, 2026 13:56
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Move import logging and log = logging.getLogger(__name__) to module level
- Remove inline import logging from retry functions and _discover_foundation_models
- Fix stale _XFAIL_MODELS references in docstrings

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants