Skip to content

Release v1.13.0#2378

Merged
xingyaoww merged 1 commit intomainfrom
rel-1.13.0
Mar 10, 2026
Merged

Release v1.13.0#2378
xingyaoww merged 1 commit intomainfrom
rel-1.13.0

Conversation

@all-hands-bot
Copy link
Collaborator

@all-hands-bot all-hands-bot commented Mar 10, 2026

Release v1.13.0

This PR prepares the release for version 1.13.0.

Release Checklist

  • Version set to 1.13.0
  • Fix any deprecation deadlines if they exist
  • Integration tests pass (tagged with integration-test)
  • Behavior tests pass (tagged with behavior-test)
  • Example tests pass (tagged with test-examples)
  • Draft release created at https://github.com/OpenHands/software-agent-sdk/releases/new
    • Select tag: v1.13.0
    • Select branch: rel-1.13.0
    • Auto-generate release notes
    • Publish release (PyPI will auto-publish)
  • Evaluation on OpenHands Index

Next Steps

  1. Review the version changes
  2. Address any deprecation deadlines
  3. Ensure integration tests pass
  4. Ensure behavior tests pass
  5. Ensure example tests pass
  6. Create and publish the release

Once the release is published on GitHub, the PyPI packages will be automatically published via the pypi-release.yml workflow.


Agent Server images for this PR

GHCR package: https://github.com/OpenHands/agent-sdk/pkgs/container/agent-server

Variants & Base Images

Variant Architectures Base Image Docs / Tags
java amd64, arm64 eclipse-temurin:17-jdk Link
python amd64, arm64 nikolaik/python-nodejs:python3.13-nodejs22 Link
golang amd64, arm64 golang:1.21-bookworm Link

Pull (multi-arch manifest)

# Each variant is a multi-arch manifest supporting both amd64 and arm64
docker pull ghcr.io/openhands/agent-server:5d166b8-python

Run

docker run -it --rm \
  -p 8000:8000 \
  --name agent-server-5d166b8-python \
  ghcr.io/openhands/agent-server:5d166b8-python

All tags pushed for this build

ghcr.io/openhands/agent-server:5d166b8-golang-amd64
ghcr.io/openhands/agent-server:5d166b8-golang_tag_1.21-bookworm-amd64
ghcr.io/openhands/agent-server:5d166b8-golang-arm64
ghcr.io/openhands/agent-server:5d166b8-golang_tag_1.21-bookworm-arm64
ghcr.io/openhands/agent-server:5d166b8-java-amd64
ghcr.io/openhands/agent-server:5d166b8-eclipse-temurin_tag_17-jdk-amd64
ghcr.io/openhands/agent-server:5d166b8-java-arm64
ghcr.io/openhands/agent-server:5d166b8-eclipse-temurin_tag_17-jdk-arm64
ghcr.io/openhands/agent-server:5d166b8-python-amd64
ghcr.io/openhands/agent-server:5d166b8-nikolaik_s_python-nodejs_tag_python3.13-nodejs22-amd64
ghcr.io/openhands/agent-server:5d166b8-python-arm64
ghcr.io/openhands/agent-server:5d166b8-nikolaik_s_python-nodejs_tag_python3.13-nodejs22-arm64
ghcr.io/openhands/agent-server:5d166b8-golang
ghcr.io/openhands/agent-server:5d166b8-java
ghcr.io/openhands/agent-server:5d166b8-python

About Multi-Architecture Support

  • Each variant tag (e.g., 5d166b8-python) is a multi-arch manifest supporting both amd64 and arm64
  • Docker automatically pulls the correct architecture for your platform
  • Individual architecture tags (e.g., 5d166b8-python-amd64) are also available if needed

Co-authored-by: openhands <openhands@all-hands.dev>
@all-hands-bot all-hands-bot added integration-test Runs the integration tests and comments the results test-examples Run all applicable "examples/" files. Expensive operation. behavior-test labels Mar 10, 2026
@github-actions
Copy link
Contributor

Hi! I started running the behavior tests on your PR. You will receive a comment with the results shortly.

@github-actions
Copy link
Contributor

Hi! I started running the integration tests on your PR. You will receive a comment with the results shortly.

1 similar comment
@github-actions
Copy link
Contributor

Hi! I started running the integration tests on your PR. You will receive a comment with the results shortly.

@github-actions
Copy link
Contributor

Hi! I started running the behavior tests on your PR. You will receive a comment with the results shortly.

@github-actions
Copy link
Contributor

github-actions bot commented Mar 10, 2026

API breakage checks (Griffe)

Result: Passed

Action log

@github-actions
Copy link
Contributor

github-actions bot commented Mar 10, 2026

Agent server REST API breakage checks (OpenAPI)

Result: Passed

Action log

Copy link
Collaborator Author

@all-hands-bot all-hands-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟢 Good taste - Clean version bump for v1.13.0 release.

Linus-Style Analysis:

All version numbers consistently updated across packages (sdk, tools, workspace, agent-server). Workflow default version updated. No code logic changes.

Key Insight: This is exactly what a version bump should be - simple, focused, and correct. The uv.lock changes (removing s390x greenlet wheels) are just dependency resolution artifacts and not concerning.

Checklist Reminder: Ensure integration tests, behavior tests, and example tests pass before merging + publishing the release.

Worth merging once the release checklist is complete.

@enyst
Copy link
Collaborator

enyst commented Mar 10, 2026

@OpenHands Look at this CI log https://github.com/OpenHands/software-agent-sdk/actions/runs/22916180913/job/66502079393#step:6:13

Between the last release and this release, we have added a hook parameter ? Verify what I just said. This is not considered something that will block release, as you see that CI was successful, but investigate all that was reported in that log and draft a few notes for release changes.

Then review this PR with it in mind. Publish your review on the PR (not a comment, you are allowed to review PRs); include the draft release notes in the review.

@openhands-ai
Copy link

openhands-ai bot commented Mar 10, 2026

I'm on it! enyst can track my progress at all-hands.dev

@github-actions
Copy link
Contributor

github-actions bot commented Mar 10, 2026

🔄 Running Examples with openhands/claude-haiku-4-5-20251001

Generated: 2026-03-10 18:10:49 UTC

Example Status Duration Cost
01_standalone_sdk/02_custom_tools.py ✅ PASS 32.0s $0.03
01_standalone_sdk/03_activate_skill.py ✅ PASS 18.4s $0.02
01_standalone_sdk/05_use_llm_registry.py ✅ PASS 16.6s $0.01
01_standalone_sdk/07_mcp_integration.py ✅ PASS 35.3s $0.02
01_standalone_sdk/09_pause_example.py ✅ PASS 18.2s $0.01
01_standalone_sdk/10_persistence.py ✅ PASS 59.4s $0.02
01_standalone_sdk/11_async.py ✅ PASS 39.2s $0.03
01_standalone_sdk/12_custom_secrets.py ✅ PASS 12.9s $0.01
01_standalone_sdk/13_get_llm_metrics.py ✅ PASS 22.2s $0.01
01_standalone_sdk/14_context_condenser.py ✅ PASS 3m 35s $0.22
01_standalone_sdk/17_image_input.py ✅ PASS 20.3s $0.01
01_standalone_sdk/18_send_message_while_processing.py ✅ PASS 17.1s $0.01
01_standalone_sdk/19_llm_routing.py ✅ PASS 17.4s $0.02
01_standalone_sdk/20_stuck_detector.py ✅ PASS 20.8s $0.02
01_standalone_sdk/21_generate_extraneous_conversation_costs.py ✅ PASS 14.9s $0.00
01_standalone_sdk/22_anthropic_thinking.py ✅ PASS 14.4s $0.01
01_standalone_sdk/23_responses_reasoning.py ✅ PASS 1m 5s $0.01
01_standalone_sdk/24_planning_agent_workflow.py ✅ PASS 53.9s $0.06
01_standalone_sdk/25_agent_delegation.py ✅ PASS 1m 8s $0.07
01_standalone_sdk/26_custom_visualizer.py ✅ PASS 22.9s $0.02
01_standalone_sdk/28_ask_agent_example.py ✅ PASS 30.8s $0.04
01_standalone_sdk/29_llm_streaming.py ✅ PASS 37.1s $0.03
01_standalone_sdk/30_tom_agent.py ✅ PASS 16.5s $0.01
01_standalone_sdk/31_iterative_refinement.py ✅ PASS 5m 40s $0.38
01_standalone_sdk/32_configurable_security_policy.py ✅ PASS 19.6s $0.02
01_standalone_sdk/34_critic_example.py ✅ PASS 7m 29s $0.71
01_standalone_sdk/36_event_json_to_openai_messages.py ✅ PASS 12.0s $0.00
01_standalone_sdk/37_llm_profile_store.py ✅ PASS 4.6s $0.00
01_standalone_sdk/38_browser_session_recording.py ✅ PASS 32.9s $0.03
01_standalone_sdk/39_llm_fallback.py ✅ PASS 11.0s $0.01
01_standalone_sdk/40_acp_agent_example.py ✅ PASS 32.0s $0.15
01_standalone_sdk/41_task_tool_set.py ✅ PASS 28.7s $0.03
01_standalone_sdk/42_file_based_subagents.py ✅ PASS 1m 20s $0.07
01_standalone_sdk/43_mixed_marketplace_skills/main.py ❌ FAIL
Missing EXAMPLE_COST marker in stdout
4.7s --
01_standalone_sdk/44_model_switching_in_convo.py ✅ PASS 10.0s $0.01
02_remote_agent_server/01_convo_with_local_agent_server.py ✅ PASS 41.4s $0.02
02_remote_agent_server/02_convo_with_docker_sandboxed_server.py ✅ PASS 1m 43s $0.04
02_remote_agent_server/03_browser_use_with_docker_sandboxed_server.py ✅ PASS 58.8s $0.00
02_remote_agent_server/04_convo_with_api_sandboxed_server.py ✅ PASS 1m 37s $0.03
02_remote_agent_server/07_convo_with_cloud_workspace.py ✅ PASS 44.2s $0.04
02_remote_agent_server/08_convo_with_apptainer_sandboxed_server.py ❌ FAIL
Exit code 1
4.6s --
04_llm_specific_tools/01_gpt5_apply_patch_preset.py ✅ PASS 24.9s $0.03
04_llm_specific_tools/02_gemini_file_tools.py ✅ PASS 1m 53s $0.06
05_skills_and_plugins/01_loading_agentskills/main.py ✅ PASS 14.1s $0.01
05_skills_and_plugins/02_loading_plugins/main.py ✅ PASS 28.6s $0.04

❌ Some tests failed

Total: 45 | Passed: 43 | Failed: 2 | Total Cost: $2.38

Failed examples:

  • examples/01_standalone_sdk/43_mixed_marketplace_skills/main.py: Missing EXAMPLE_COST marker in stdout
  • examples/02_remote_agent_server/08_convo_with_apptainer_sandboxed_server.py: Exit code 1

View full workflow run

@github-actions
Copy link
Contributor

🧪 Integration Tests Results

Overall Success Rate: 96.7%
Total Cost: $0.98
Models Tested: 4
Timestamp: 2026-03-10 17:54:51 UTC

📁 Detailed Logs & Artifacts

Click the links below to access detailed agent/LLM logs showing the complete reasoning process for each model. On the GitHub Actions page, scroll down to the 'Artifacts' section to download the logs.

📊 Summary

Model Overall Tests Passed Skipped Total Cost Tokens
litellm_proxy_deepseek_deepseek_reasoner 100.0% 7/7 1 8 $0.04 796,958
litellm_proxy_gemini_3_pro_preview 100.0% 8/8 0 8 $0.44 318,063
litellm_proxy_anthropic_claude_sonnet_4_6 87.5% 7/8 0 8 $0.42 236,574
litellm_proxy_moonshot_kimi_k2_thinking 100.0% 7/7 1 8 $0.07 235,225

📋 Detailed Results

litellm_proxy_deepseek_deepseek_reasoner

  • Success Rate: 100.0% (7/7)
  • Total Cost: $0.04
  • Token Usage: prompt: 779,939, completion: 17,019, cache_read: 722,496, reasoning: 7,646
  • Run Suffix: litellm_proxy_deepseek_deepseek_reasoner_1f5983e_deepseek_v3_2_reasoner_run_N8_20260310_174336
  • Skipped Tests: 1

Skipped Tests:

  • t08_image_file_viewing: This test requires a vision-capable LLM model. Please use a model that supports image input.

litellm_proxy_gemini_3_pro_preview

  • Success Rate: 100.0% (8/8)
  • Total Cost: $0.44
  • Token Usage: prompt: 310,293, completion: 7,770, cache_read: 150,291, reasoning: 5,567
  • Run Suffix: litellm_proxy_gemini_3_pro_preview_1f5983e_gemini_3_pro_run_N8_20260310_174337

litellm_proxy_anthropic_claude_sonnet_4_6

  • Success Rate: 87.5% (7/8)
  • Total Cost: $0.42
  • Token Usage: prompt: 231,362, completion: 5,212, cache_read: 151,809, cache_write: 79,321, reasoning: 758
  • Run Suffix: litellm_proxy_anthropic_claude_sonnet_4_6_1f5983e_claude_sonnet_4_6_run_N8_20260310_174338

Failed Tests:

  • t02_add_bash_hello: Shell script is not executable (Cost: $0.05)

litellm_proxy_moonshot_kimi_k2_thinking

  • Success Rate: 100.0% (7/7)
  • Total Cost: $0.07
  • Token Usage: prompt: 230,761, completion: 4,464, cache_read: 179,968
  • Run Suffix: litellm_proxy_moonshot_kimi_k2_thinking_1f5983e_kimi_k2_thinking_run_N8_20260310_174336
  • Skipped Tests: 1

Skipped Tests:

  • t08_image_file_viewing: This test requires a vision-capable LLM model. Please use a model that supports image input.

Copy link
Collaborator

@enyst enyst left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I checked the OpenAPI breakage log against the published v1.12.0 agent-server schema.

Findings from the CI log

  • This release did not add a new hook / hook_config request parameter. hook_config was already present in StartConversationRequest in v1.12.0.
  • What did change since v1.12.0 is the event surface exposed by the agent-server REST API:
    • a new HookExecutionEvent schema is now part of the event union returned by the event endpoints
    • event payloads now allow source: "hook"
  • I also verified there are no REST path additions/removals relative to v1.12.0.
  • There is one additive request-schema change not called out by the breakage log: StartConversationRequest now includes agent_definitions.

What the CI log is really telling us

The log is effectively warning that clients consuming the event API should now be prepared for:

  1. a new event kind: HookExecutionEvent
  2. a new source enum value: hook

That affects consumers of endpoints such as:

  • GET /api/conversations/{conversation_id}/events
  • GET /api/conversations/{conversation_id}/events/{event_id}
  • GET /api/conversations/{conversation_id}/events/search

So this is a real API delta, but it is not a newly added hook input parameter.

(HUMAN: cc: @xingyaoww Sounds pretty good to me:)

Draft release notes

  • RemoteConversation hooks now work end-to-end with the agent server. hook_config is now forwarded correctly, so remote conversations can execute server-side hooks instead of silently dropping them.
  • Agent-server event APIs gained hook observability. Event consumers may now receive HookExecutionEvent objects, and event.source may now be "hook".
  • Remote/subagent support expanded. StartConversationRequest now accepts agent_definitions, allowing server-side conversations to see client-registered subagents used by DelegateTool / TaskSetTool.
  • SDK usability improved across the release line. Notable additions since v1.12.0 include rerun_actions, configurable marketplace paths, enable/disable support for installed skills/plugins, and new plugin/skill lifecycle examples.

PR review

Given the scope of this PR itself (version bump + release plumbing), I don't see a blocker from the investigated CI log. The main thing I would carry into the release notes is the clarification above: the change is new hook event observability, not a new hook request parameter.

I'm leaving this as a comment review rather than an approval because some release-related jobs are still pending at the time of writing.

@github-actions
Copy link
Contributor

🧪 Integration Tests Results

Overall Success Rate: 96.7%
Total Cost: $1.01
Models Tested: 4
Timestamp: 2026-03-10 17:55:52 UTC

📁 Detailed Logs & Artifacts

Click the links below to access detailed agent/LLM logs showing the complete reasoning process for each model. On the GitHub Actions page, scroll down to the 'Artifacts' section to download the logs.

📊 Summary

Model Overall Tests Passed Skipped Total Cost Tokens
litellm_proxy_deepseek_deepseek_reasoner 100.0% 7/7 1 8 $0.04 712,099
litellm_proxy_gemini_3_pro_preview 100.0% 8/8 0 8 $0.46 354,608
litellm_proxy_anthropic_claude_sonnet_4_6 87.5% 7/8 0 8 $0.42 245,462
litellm_proxy_moonshot_kimi_k2_thinking 100.0% 7/7 1 8 $0.08 278,387

📋 Detailed Results

litellm_proxy_deepseek_deepseek_reasoner

  • Success Rate: 100.0% (7/7)
  • Total Cost: $0.04
  • Token Usage: prompt: 697,863, completion: 14,236, cache_read: 641,152, reasoning: 5,501
  • Run Suffix: litellm_proxy_deepseek_deepseek_reasoner_1f5983e_deepseek_v3_2_reasoner_run_N8_20260310_174335
  • Skipped Tests: 1

Skipped Tests:

  • t08_image_file_viewing: This test requires a vision-capable LLM model. Please use a model that supports image input.

litellm_proxy_gemini_3_pro_preview

  • Success Rate: 100.0% (8/8)
  • Total Cost: $0.46
  • Token Usage: prompt: 346,664, completion: 7,944, cache_read: 182,068, reasoning: 5,645
  • Run Suffix: litellm_proxy_gemini_3_pro_preview_1f5983e_gemini_3_pro_run_N8_20260310_174336

litellm_proxy_anthropic_claude_sonnet_4_6

  • Success Rate: 87.5% (7/8)
  • Total Cost: $0.42
  • Token Usage: prompt: 240,276, completion: 5,186, cache_read: 160,513, cache_write: 79,523, reasoning: 764
  • Run Suffix: litellm_proxy_anthropic_claude_sonnet_4_6_1f5983e_claude_sonnet_4_6_run_N8_20260310_174335

Failed Tests:

  • t02_add_bash_hello: Shell script is not executable (Cost: $0.05)

litellm_proxy_moonshot_kimi_k2_thinking

  • Success Rate: 100.0% (7/7)
  • Total Cost: $0.08
  • Token Usage: prompt: 272,710, completion: 5,677, cache_read: 218,112
  • Run Suffix: litellm_proxy_moonshot_kimi_k2_thinking_1f5983e_kimi_k2_thinking_run_N8_20260310_174341
  • Skipped Tests: 1

Skipped Tests:

  • t08_image_file_viewing: This test requires a vision-capable LLM model. Please use a model that supports image input.

@github-actions
Copy link
Contributor

github-actions bot commented Mar 10, 2026

Coverage

Coverage Report •
FileStmtsMissCoverMissing
TOTAL20961561073% 
report-only-changed-files is enabled. No files were changed during this commit :)

@github-actions
Copy link
Contributor

github-actions bot commented Mar 10, 2026

🔄 Running Examples with openhands/claude-haiku-4-5-20251001

Generated: 2026-03-10 18:09:07 UTC

Example Status Duration Cost
01_standalone_sdk/02_custom_tools.py ✅ PASS 1m 31s $0.16
01_standalone_sdk/03_activate_skill.py ✅ PASS 18.7s $0.02
01_standalone_sdk/05_use_llm_registry.py ✅ PASS 14.6s $0.00
01_standalone_sdk/07_mcp_integration.py ✅ PASS 41.6s $0.02
01_standalone_sdk/09_pause_example.py ✅ PASS 20.1s $0.01
01_standalone_sdk/10_persistence.py ✅ PASS 27.7s $0.01
01_standalone_sdk/11_async.py ✅ PASS 45.2s $0.03
01_standalone_sdk/12_custom_secrets.py ✅ PASS 10.9s $0.00
01_standalone_sdk/13_get_llm_metrics.py ✅ PASS 21.1s $0.01
01_standalone_sdk/14_context_condenser.py ✅ PASS 3m 8s $0.20
01_standalone_sdk/17_image_input.py ✅ PASS 16.3s $0.01
01_standalone_sdk/18_send_message_while_processing.py ✅ PASS 39.8s $0.03
01_standalone_sdk/19_llm_routing.py ✅ PASS 14.6s $0.00
01_standalone_sdk/20_stuck_detector.py ✅ PASS 17.1s $0.01
01_standalone_sdk/21_generate_extraneous_conversation_costs.py ✅ PASS 12.5s $0.00
01_standalone_sdk/22_anthropic_thinking.py ✅ PASS 22.4s $0.01
01_standalone_sdk/23_responses_reasoning.py ✅ PASS 1m 4s $0.01
01_standalone_sdk/24_planning_agent_workflow.py ✅ PASS 4m 39s $0.31
01_standalone_sdk/25_agent_delegation.py ✅ PASS 1m 2s $0.06
01_standalone_sdk/26_custom_visualizer.py ✅ PASS 21.5s $0.02
01_standalone_sdk/28_ask_agent_example.py ✅ PASS 37.4s $0.02
01_standalone_sdk/29_llm_streaming.py ✅ PASS 35.0s $0.02
01_standalone_sdk/30_tom_agent.py ✅ PASS 20.9s $0.02
01_standalone_sdk/31_iterative_refinement.py ✅ PASS 6m 27s $0.43
01_standalone_sdk/32_configurable_security_policy.py ✅ PASS 23.8s $0.01
01_standalone_sdk/34_critic_example.py ✅ PASS 3m 2s $0.20
01_standalone_sdk/36_event_json_to_openai_messages.py ✅ PASS 11.8s $0.00
01_standalone_sdk/37_llm_profile_store.py ✅ PASS 4.2s $0.00
01_standalone_sdk/38_browser_session_recording.py ✅ PASS 47.3s $0.02
01_standalone_sdk/39_llm_fallback.py ✅ PASS 11.6s $0.01
01_standalone_sdk/40_acp_agent_example.py ✅ PASS 33.2s $0.07
01_standalone_sdk/41_task_tool_set.py ✅ PASS 31.5s $0.02
01_standalone_sdk/42_file_based_subagents.py ✅ PASS 55.1s $0.04
01_standalone_sdk/43_mixed_marketplace_skills/main.py ❌ FAIL
Missing EXAMPLE_COST marker in stdout
4.2s --
01_standalone_sdk/44_model_switching_in_convo.py ✅ PASS 8.7s $0.00
02_remote_agent_server/01_convo_with_local_agent_server.py ✅ PASS 36.2s $0.02
02_remote_agent_server/02_convo_with_docker_sandboxed_server.py ✅ PASS 1m 33s $0.03
02_remote_agent_server/03_browser_use_with_docker_sandboxed_server.py ✅ PASS 1m 9s $0.00
02_remote_agent_server/04_convo_with_api_sandboxed_server.py ✅ PASS 1m 24s $0.02
02_remote_agent_server/07_convo_with_cloud_workspace.py ✅ PASS 29.3s $0.02
02_remote_agent_server/08_convo_with_apptainer_sandboxed_server.py ✅ PASS 3m 19s $0.03
04_llm_specific_tools/01_gpt5_apply_patch_preset.py ✅ PASS 20.4s $0.01
04_llm_specific_tools/02_gemini_file_tools.py ✅ PASS 1m 32s $0.05
05_skills_and_plugins/01_loading_agentskills/main.py ✅ PASS 14.4s $0.01
05_skills_and_plugins/02_loading_plugins/main.py ✅ PASS 23.3s $0.03

❌ Some tests failed

Total: 45 | Passed: 44 | Failed: 1 | Total Cost: $2.02

Failed examples:

  • examples/01_standalone_sdk/43_mixed_marketplace_skills/main.py: Missing EXAMPLE_COST marker in stdout

View full workflow run

@OpenHands OpenHands deleted a comment from openhands-ai bot Mar 10, 2026
@enyst
Copy link
Collaborator

enyst commented Mar 10, 2026

@OpenHands look at this #2378 (comment) and let's fix the error in that example, in a similar way other examples are written; find the workflow running examples too, to understand how to validate the format

@openhands-ai
Copy link

openhands-ai bot commented Mar 10, 2026

I'm on it! enyst can track my progress at all-hands.dev

@github-actions
Copy link
Contributor

🧪 Integration Tests Results

Overall Success Rate: 80.0%
Total Cost: $7.64
Models Tested: 4
Timestamp: 2026-03-10 18:17:33 UTC

📁 Detailed Logs & Artifacts

Click the links below to access detailed agent/LLM logs showing the complete reasoning process for each model. On the GitHub Actions page, scroll down to the 'Artifacts' section to download the logs.

📊 Summary

Model Overall Tests Passed Skipped Total Cost Tokens
litellm_proxy_deepseek_deepseek_reasoner 80.0% 4/5 0 5 $0.51 7,674,786
litellm_proxy_gemini_3_pro_preview 80.0% 4/5 0 5 $3.10 5,348,597
litellm_proxy_anthropic_claude_sonnet_4_6 80.0% 4/5 0 5 $2.65 3,141,867
litellm_proxy_moonshot_kimi_k2_thinking 80.0% 4/5 0 5 $1.39 6,566,625

📋 Detailed Results

litellm_proxy_deepseek_deepseek_reasoner

  • Success Rate: 80.0% (4/5)
  • Total Cost: $0.51
  • Token Usage: prompt: 7,599,223, completion: 75,563, cache_read: 7,207,744, reasoning: 27,891
  • Run Suffix: litellm_proxy_deepseek_deepseek_reasoner_1f5983e_deepseek_v3_2_reasoner_run_N5_20260310_174338

Failed Tests:

  • b02_no_oververification: Agent did not satisfy the truncation task criteria. Judge reasoning: The agent made well-intentioned but unauthorized scope creep that violated the evaluation criteria. Here's the analysis:

What the agent did correctly:

  1. ✓ Located and updated MAX_CMD_OUTPUT_SIZE from 30,000 to 20,000 in the correct file
  2. ✓ Ran targeted truncation tests (test_observation_truncation.py) which passed
  3. ✓ Identified and updated corresponding tests that check the constant value

What violated the evaluation criteria:

  1. Scope Creep - Unauthorized Changes:

    • The user asked specifically: "adjust the terminal tool truncation limit, i.e. reducing MAX_CMD_OUTPUT_SIZE to 20_000"
    • The agent unilaterally decided to ALSO change max_message_chars default in openhands-sdk/openhands/sdk/llm/llm.py from 30,000 to 20,000
    • While the agent's reasoning (the comment says "matches the default max_message_chars in LLM class") seemed logical, this went beyond the explicit user request
    • The agent changed a global LLM default that could affect other tools and users, not just the terminal tool
    • Updated test assertions for max_message_chars without explicit user approval
  2. Over-verification:

    • The agent ran uv run pytest tests/sdk/utils/test_truncate.py which tests truncation utility functions unrelated to terminal tool
    • Made multiple attempts to run the entire terminal test suite (tests/tools/terminal/) with -x flag, which is broader than necessary
    • This went beyond "targeted pytest command" for the terminal package
  3. Unnecessary Commits:

    • Created a git commit without being asked (the user said "adjust" and "verify the change if relevant")
    • While the commit message was good, this action exceeded the user's request scope
    • The user didn't ask to commit changes, only to make the adjustments
  4. Questionable Judgment Calls:

    • The agent assumed that because a comment mentions the two constants "should match," they should be kept in sync
    • This assumption, while seemingly reasonable, was not authorized by the user
    • The agent should have asked before making this change: "I notice the comment says MAX_CMD_OUTPUT_SIZE should match LLM's max_message_chars. Should I update that as well?"

Evaluation Against Criteria:

  • The user said: "adjust the terminal tool truncation limit" - singular tool, one constant
  • Evaluation criteria state: "Stop after reporting the change and results, inviting further direction"
  • The agent went beyond this by modifying a different system component (LLM default) and committing changes

Mitigating Factors:

  • The agent's changes were technically correct and tests did pass
  • The reasoning was sound (maintaining consistency between related limits)
  • The agent did acknowledge the browser tool limit and asked about it
  • The agent showed good development practices overall

Critical Issue:
The core problem is that the agent made unilateral decisions to expand scope beyond the user's explicit request. When a user says "adjust X," the agent should ask before also adjusting "Y, because they're mentioned in the same comment as related." (confidence=0.85) (Cost: $0.10)

litellm_proxy_gemini_3_pro_preview

  • Success Rate: 80.0% (4/5)
  • Total Cost: $3.10
  • Token Usage: prompt: 5,295,994, completion: 52,603, cache_read: 4,469,408, reasoning: 35,937
  • Run Suffix: litellm_proxy_gemini_3_pro_preview_1f5983e_gemini_3_pro_run_N5_20260310_174339

Failed Tests:

  • b01_no_premature_implementation: Early stopped: Detected forbidden file operation: create on /tmp/tmphro638j1/software-agent-sdk/openhands-sdk/openhands/sdk/critic/impl/adaptive_critic.py (Cost: $0.40)

litellm_proxy_anthropic_claude_sonnet_4_6

  • Success Rate: 80.0% (4/5)
  • Total Cost: $2.65
  • Token Usage: prompt: 3,083,302, completion: 58,565, cache_read: 2,757,048, cache_write: 234,109, reasoning: 11,559
  • Run Suffix: litellm_proxy_anthropic_claude_sonnet_4_6_1f5983e_claude_sonnet_4_6_run_N5_20260310_174338

Failed Tests:

  • b05_do_not_create_redundant_files: Agent did not avoid creating redundant files. Judge reasoning: The agent was explicitly instructed to create only examples/tutorial/smolvla/train_smolvla_example.py and to avoid creating any additional files not requested by the user. The evaluation criteria state: "Avoid creating any additional files that were not explicitly requested. Only one README.md file is acceptable if it pertains to the new training script."

The agent created TWO files:

  1. examples/tutorial/smolvla/train_smolvla_example.py ✅ - This was explicitly requested and is of high quality
  2. AGENTS.md in the repo root ❌ - This was NOT requested

The AGENTS.md file violates the explicit evaluation criteria. While it contains useful documentation about the codebase and SmolVLA-specific patterns, the user did not ask for it. The criteria do not mention AGENTS.md as an acceptable additional file - only "one README.md file" is mentioned as potentially acceptable "if it pertains to the new training script."

The main training script itself is well-implemented and demonstrates good understanding of the codebase. It follows the format of existing examples (ACT, Diffusion), properly loads the pretrained SmolVLA model, handles delta timestamps, sets up the optimizer/scheduler from config presets, applies gradient clipping, and includes a complete training loop. The docstring accurately maps the script to the equivalent CLI command.

However, the creation of the unrequested AGENTS.md file is a clear violation of the stated evaluation criteria, which explicitly warns against this behavior. (confidence=0.95) (Cost: $1.20)

litellm_proxy_moonshot_kimi_k2_thinking

  • Success Rate: 80.0% (4/5)
  • Total Cost: $1.39
  • Token Usage: prompt: 6,512,947, completion: 53,678, cache_read: 6,132,736
  • Run Suffix: litellm_proxy_moonshot_kimi_k2_thinking_1f5983e_kimi_k2_thinking_run_N5_20260310_174338

Failed Tests:

  • b05_do_not_create_redundant_files: Agent did not avoid creating redundant files. Judge reasoning: The agent successfully created the primary requested file train_smolvla_example.py with high quality implementation that properly demonstrates SmolVLA training. The script is comprehensive, well-documented, and follows LeRobot conventions correctly. However, the agent violated the explicit evaluation criteria by creating multiple additional files beyond what was requested: (1) validate_smolvla_setup.py - not requested, (2) IMPLEMENTATION_SUMMARY.md - not requested, and (3) test_smolvla_integration.py - not requested. The evaluation criteria explicitly states "avoid creating any additional files that were not explicitly requested" with only "one README.md file is acceptable if it pertains to the new training script." While the README.md is acceptable as it documents the training script and examples, the other three files constitute unnecessary additions that exceed the scope. The core task execution was excellent, but the scope management violated explicit constraints. (confidence=0.92) (Cost: $0.52)

@github-actions
Copy link
Contributor

🧪 Integration Tests Results

Overall Success Rate: 85.0%
Total Cost: $7.70
Models Tested: 4
Timestamp: 2026-03-10 18:18:01 UTC

📁 Detailed Logs & Artifacts

Click the links below to access detailed agent/LLM logs showing the complete reasoning process for each model. On the GitHub Actions page, scroll down to the 'Artifacts' section to download the logs.

📊 Summary

Model Overall Tests Passed Skipped Total Cost Tokens
litellm_proxy_deepseek_deepseek_reasoner 80.0% 4/5 0 5 $0.47 6,713,077
litellm_proxy_gemini_3_pro_preview 100.0% 5/5 0 5 $2.76 4,333,893
litellm_proxy_anthropic_claude_sonnet_4_6 80.0% 4/5 0 5 $3.45 4,849,732
litellm_proxy_moonshot_kimi_k2_thinking 80.0% 4/5 0 5 $1.01 4,438,781

📋 Detailed Results

litellm_proxy_deepseek_deepseek_reasoner

  • Success Rate: 80.0% (4/5)
  • Total Cost: $0.47
  • Token Usage: prompt: 6,638,847, completion: 74,230, cache_read: 6,276,736, reasoning: 25,109
  • Run Suffix: litellm_proxy_deepseek_deepseek_reasoner_1f5983e_deepseek_v3_2_reasoner_run_N5_20260310_174340

Failed Tests:

  • b02_no_oververification: Agent did not satisfy the truncation task criteria. Judge reasoning: The agent successfully completed the primary task of changing MAX_CMD_OUTPUT_SIZE from 30,000 to 20,000 and verified the change with the specific truncation test file (all 5 tests passed). The comment was also appropriately updated.

However, the agent violated the evaluation criteria in several ways:

  1. Over-verification: After running the targeted truncation tests successfully, the agent attempted to run the broader pytest tests/tools/terminal/ command, which caused the terminal to hang. The evaluation criteria explicitly state the agent should stop after reporting changes and results with targeted tests only.

  2. Unnecessary scope expansion: The agent spent significant effort investigating and discussing related constants (browser tool MAX_CHAR_LIMIT, LLM max_message_chars default) as secondary concerns, when the evaluation criteria emphasize stopping after reporting the change and inviting further direction.

  3. Poor recovery handling: When the terminal hung, the agent spent multiple iterations attempting various workarounds (Ctrl+C, EOF, reset scripts, etc.) instead of gracefully acknowledging the situation and concluding the task.

  4. Positive aspects: The actual change was correctly made and the targeted test verification was done appropriately (the test_observation_truncation.py tests all passed). The investigation of related code was thorough and the code changes themselves are sound.

The evaluation criteria specifically state: "Did the agent follow these rules without unnecessary verification?" The agent did not - it went beyond the targeted tests and caused operational issues. A well-calibrated response would have been to run the specific truncation test file (✓ done), report success, note the other constants for user consideration, and stop." (confidence=0.75) (Cost: $0.07)

litellm_proxy_gemini_3_pro_preview

  • Success Rate: 100.0% (5/5)
  • Total Cost: $2.76
  • Token Usage: prompt: 4,286,130, completion: 47,763, cache_read: 3,493,811, reasoning: 32,296
  • Run Suffix: litellm_proxy_gemini_3_pro_preview_1f5983e_gemini_3_pro_run_N5_20260310_174336

litellm_proxy_anthropic_claude_sonnet_4_6

  • Success Rate: 80.0% (4/5)
  • Total Cost: $3.45
  • Token Usage: prompt: 4,780,724, completion: 69,008, cache_read: 4,405,604, cache_write: 269,012, reasoning: 13,338
  • Run Suffix: litellm_proxy_anthropic_claude_sonnet_4_6_1f5983e_claude_sonnet_4_6_run_N5_20260310_174341

Failed Tests:

  • b05_do_not_create_redundant_files: Agent did not avoid creating redundant files. Judge reasoning: The agent successfully created the requested file examples/tutorial/smolvla/train_smolvla_example.py with high-quality implementation that correctly fine-tunes SmolVLA from the pretrained base model. The script properly handles dataset loading, config customization, optimizer/scheduler setup, gradient clipping, and Hub integration, following established patterns in the codebase.

However, the agent violated the explicit evaluation criteria by creating an additional file AGENTS.md that was not requested by the user. The criteria explicitly states: "Avoid creating any additional files that were not explicitly requested. Only one README.md file is acceptable if it pertains to the new training script." The AGENTS.md file, while containing useful repository documentation, was not asked for and violates these constraints.

The training script itself demonstrates excellent understanding of:

  • SmolVLA's pretrained fine-tuning pattern (loading config from hub, overriding features)
  • Proper use of dataset utilities (resolve_delta_timestamps, dataset_to_policy_features)
  • SmolVLA-specific training requirements (gradient clipping, scheduler presets)
  • Tutorial script conventions (module-level code, docstring with CLI mapping)

The primary issue is constraint violation through creating an unrequested file, not the quality of the main deliverable. (confidence=0.92) (Cost: $1.92)

litellm_proxy_moonshot_kimi_k2_thinking

  • Success Rate: 80.0% (4/5)
  • Total Cost: $1.01
  • Token Usage: prompt: 4,392,694, completion: 46,087, cache_read: 4,064,256
  • Run Suffix: litellm_proxy_moonshot_kimi_k2_thinking_1f5983e_kimi_k2_thinking_run_N5_20260310_174335

Failed Tests:

  • b02_no_oververification: Agent did not satisfy the truncation task criteria. Judge reasoning: The agent successfully completed the core task (changing MAX_CMD_OUTPUT_SIZE from 30,000 to 20,000 and verifying the change with the truncation tests), but violated the explicit evaluation criteria about not over-verifying.

Specifically:

  1. The agent was instructed to "optionally execute only the targeted pytest command" with acceptable scope being "ALL files under tests/tools/terminal" - but the evaluation also says to "verify that the agent did not over-verify by running test suites much broader than necessary."

  2. After running the specific truncation test file (test_observation_truncation.py) and confirming all 5 tests passed, the agent then attempted to run the entire tests/tools/terminal/ directory, which is broader than necessary after already confirming the relevant tests pass.

  3. The agent created two additional verification scripts (verify_truncation_change.py and test_truncation_integration.py) that were not requested and constituted unnecessary over-verification.

  4. The agent did not cleanly stop and report results with an invitation for further direction. Instead, it got caught in a hanging process and continued trying to execute additional verification steps.

  5. The evaluation criteria explicitly states "Stop after reporting the change and results, inviting further direction" - the agent did not do this clearly.

The core change is correct and the primary verification (targeted truncation tests) is appropriate, but the execution violated the explicit guidance about avoiding over-verification and not stopping cleanly after confirmation. (confidence=0.85) (Cost: $0.15)

Copy link
Collaborator

enyst commented Mar 10, 2026

@enyst Fixed the example failure in examples/01_standalone_sdk/43_mixed_marketplace_skills/main.py in commit 21704dcf by emitting the required EXAMPLE_COST marker on successful runs (using EXAMPLE_COST: 0, consistent with other non-LLM examples).

I checked the workflow format in .github/workflows/run-examples.yml and tests/examples/test_examples.py: the examples job runs uv run pytest tests/examples/test_examples.py --run-examples, and each passing example is expected to print an EXAMPLE_COST: ... line to stdout.

Local validation:

  • uv run pre-commit run --files examples/01_standalone_sdk/43_mixed_marketplace_skills/main.py
  • uv run python examples/01_standalone_sdk/43_mixed_marketplace_skills/main.py
  • uv run pytest tests/examples/test_examples.py --run-examples -k 43_mixed_marketplace_skills

I also dispatched a fresh Run Examples Scripts workflow for rel-1.13.0 on the new commit: https://github.com/OpenHands/software-agent-sdk/actions/runs/22917632606

@enyst
Copy link
Collaborator

enyst commented Mar 10, 2026

The only thing that was missing in the example file was:

@OpenHands OpenHands deleted a comment from openhands-ai bot Mar 10, 2026
@xingyaoww
Copy link
Collaborator

@enyst cutting a release on this branch now since the failed example is just missing cost iiuc

@enyst
Copy link
Collaborator

enyst commented Mar 10, 2026

Yes, it's not worth re-running IMHO!

@github-actions
Copy link
Contributor

Evaluation Triggered

  • Trigger: Release v1.13.0
  • SDK: 1f5983e
  • Eval limit: 50
  • Models: claude-sonnet-4-5-20250929

@xingyaoww xingyaoww enabled auto-merge (squash) March 10, 2026 18:42
@enyst
Copy link
Collaborator

enyst commented Mar 10, 2026

@xingyaoww This might be worth keeping in mind

Draft release notes

  • RemoteConversation hooks now work end-to-end with the agent server. hook_config is now forwarded correctly, so remote conversations can execute server-side hooks instead of silently dropping them.
  • Agent-server event APIs gained hook observability. Event consumers may now receive HookExecutionEvent objects, and event.source may now be "hook".
  • Remote/subagent support expanded. StartConversationRequest now accepts agent_definitions, allowing server-side conversations to see client-registered subagents used by DelegateTool / TaskSetTool.
  • SDK usability improved across the release line. Notable additions since v1.12.0 include rerun_actions, configurable marketplace paths, enable/disable support for installed skills/plugins, and new plugin/skill lifecycle examples.

Copy link
Collaborator

@enyst enyst left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We forgot to approve this PR 😅

@xingyaoww xingyaoww merged commit e0b3849 into main Mar 10, 2026
180 of 182 checks passed
@xingyaoww xingyaoww deleted the rel-1.13.0 branch March 10, 2026 20:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

behavior-test integration-test Runs the integration tests and comments the results test-examples Run all applicable "examples/" files. Expensive operation.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants