Conversation
Co-authored-by: openhands <openhands@all-hands.dev>
|
Hi! I started running the behavior tests on your PR. You will receive a comment with the results shortly. |
|
Hi! I started running the integration tests on your PR. You will receive a comment with the results shortly. |
1 similar comment
|
Hi! I started running the integration tests on your PR. You will receive a comment with the results shortly. |
|
Hi! I started running the behavior tests on your PR. You will receive a comment with the results shortly. |
API breakage checks (Griffe)Result: Passed |
Agent server REST API breakage checks (OpenAPI)Result: Passed |
all-hands-bot
left a comment
There was a problem hiding this comment.
🟢 Good taste - Clean version bump for v1.13.0 release.
Linus-Style Analysis:
All version numbers consistently updated across packages (sdk, tools, workspace, agent-server). Workflow default version updated. No code logic changes.
Key Insight: This is exactly what a version bump should be - simple, focused, and correct. The uv.lock changes (removing s390x greenlet wheels) are just dependency resolution artifacts and not concerning.
Checklist Reminder: Ensure integration tests, behavior tests, and example tests pass before merging + publishing the release.
✅ Worth merging once the release checklist is complete.
|
@OpenHands Look at this CI log https://github.com/OpenHands/software-agent-sdk/actions/runs/22916180913/job/66502079393#step:6:13 Between the last release and this release, we have added a hook parameter ? Verify what I just said. This is not considered something that will block release, as you see that CI was successful, but investigate all that was reported in that log and draft a few notes for release changes. Then review this PR with it in mind. Publish your review on the PR (not a comment, you are allowed to review PRs); include the draft release notes in the review. |
|
I'm on it! enyst can track my progress at all-hands.dev |
🔄 Running Examples with
|
| Example | Status | Duration | Cost |
|---|---|---|---|
| 01_standalone_sdk/02_custom_tools.py | ✅ PASS | 32.0s | $0.03 |
| 01_standalone_sdk/03_activate_skill.py | ✅ PASS | 18.4s | $0.02 |
| 01_standalone_sdk/05_use_llm_registry.py | ✅ PASS | 16.6s | $0.01 |
| 01_standalone_sdk/07_mcp_integration.py | ✅ PASS | 35.3s | $0.02 |
| 01_standalone_sdk/09_pause_example.py | ✅ PASS | 18.2s | $0.01 |
| 01_standalone_sdk/10_persistence.py | ✅ PASS | 59.4s | $0.02 |
| 01_standalone_sdk/11_async.py | ✅ PASS | 39.2s | $0.03 |
| 01_standalone_sdk/12_custom_secrets.py | ✅ PASS | 12.9s | $0.01 |
| 01_standalone_sdk/13_get_llm_metrics.py | ✅ PASS | 22.2s | $0.01 |
| 01_standalone_sdk/14_context_condenser.py | ✅ PASS | 3m 35s | $0.22 |
| 01_standalone_sdk/17_image_input.py | ✅ PASS | 20.3s | $0.01 |
| 01_standalone_sdk/18_send_message_while_processing.py | ✅ PASS | 17.1s | $0.01 |
| 01_standalone_sdk/19_llm_routing.py | ✅ PASS | 17.4s | $0.02 |
| 01_standalone_sdk/20_stuck_detector.py | ✅ PASS | 20.8s | $0.02 |
| 01_standalone_sdk/21_generate_extraneous_conversation_costs.py | ✅ PASS | 14.9s | $0.00 |
| 01_standalone_sdk/22_anthropic_thinking.py | ✅ PASS | 14.4s | $0.01 |
| 01_standalone_sdk/23_responses_reasoning.py | ✅ PASS | 1m 5s | $0.01 |
| 01_standalone_sdk/24_planning_agent_workflow.py | ✅ PASS | 53.9s | $0.06 |
| 01_standalone_sdk/25_agent_delegation.py | ✅ PASS | 1m 8s | $0.07 |
| 01_standalone_sdk/26_custom_visualizer.py | ✅ PASS | 22.9s | $0.02 |
| 01_standalone_sdk/28_ask_agent_example.py | ✅ PASS | 30.8s | $0.04 |
| 01_standalone_sdk/29_llm_streaming.py | ✅ PASS | 37.1s | $0.03 |
| 01_standalone_sdk/30_tom_agent.py | ✅ PASS | 16.5s | $0.01 |
| 01_standalone_sdk/31_iterative_refinement.py | ✅ PASS | 5m 40s | $0.38 |
| 01_standalone_sdk/32_configurable_security_policy.py | ✅ PASS | 19.6s | $0.02 |
| 01_standalone_sdk/34_critic_example.py | ✅ PASS | 7m 29s | $0.71 |
| 01_standalone_sdk/36_event_json_to_openai_messages.py | ✅ PASS | 12.0s | $0.00 |
| 01_standalone_sdk/37_llm_profile_store.py | ✅ PASS | 4.6s | $0.00 |
| 01_standalone_sdk/38_browser_session_recording.py | ✅ PASS | 32.9s | $0.03 |
| 01_standalone_sdk/39_llm_fallback.py | ✅ PASS | 11.0s | $0.01 |
| 01_standalone_sdk/40_acp_agent_example.py | ✅ PASS | 32.0s | $0.15 |
| 01_standalone_sdk/41_task_tool_set.py | ✅ PASS | 28.7s | $0.03 |
| 01_standalone_sdk/42_file_based_subagents.py | ✅ PASS | 1m 20s | $0.07 |
| 01_standalone_sdk/43_mixed_marketplace_skills/main.py | ❌ FAIL Missing EXAMPLE_COST marker in stdout |
4.7s | -- |
| 01_standalone_sdk/44_model_switching_in_convo.py | ✅ PASS | 10.0s | $0.01 |
| 02_remote_agent_server/01_convo_with_local_agent_server.py | ✅ PASS | 41.4s | $0.02 |
| 02_remote_agent_server/02_convo_with_docker_sandboxed_server.py | ✅ PASS | 1m 43s | $0.04 |
| 02_remote_agent_server/03_browser_use_with_docker_sandboxed_server.py | ✅ PASS | 58.8s | $0.00 |
| 02_remote_agent_server/04_convo_with_api_sandboxed_server.py | ✅ PASS | 1m 37s | $0.03 |
| 02_remote_agent_server/07_convo_with_cloud_workspace.py | ✅ PASS | 44.2s | $0.04 |
| 02_remote_agent_server/08_convo_with_apptainer_sandboxed_server.py | ❌ FAIL Exit code 1 |
4.6s | -- |
| 04_llm_specific_tools/01_gpt5_apply_patch_preset.py | ✅ PASS | 24.9s | $0.03 |
| 04_llm_specific_tools/02_gemini_file_tools.py | ✅ PASS | 1m 53s | $0.06 |
| 05_skills_and_plugins/01_loading_agentskills/main.py | ✅ PASS | 14.1s | $0.01 |
| 05_skills_and_plugins/02_loading_plugins/main.py | ✅ PASS | 28.6s | $0.04 |
❌ Some tests failed
Total: 45 | Passed: 43 | Failed: 2 | Total Cost: $2.38
Failed examples:
- examples/01_standalone_sdk/43_mixed_marketplace_skills/main.py: Missing EXAMPLE_COST marker in stdout
- examples/02_remote_agent_server/08_convo_with_apptainer_sandboxed_server.py: Exit code 1
🧪 Integration Tests ResultsOverall Success Rate: 96.7% 📁 Detailed Logs & ArtifactsClick the links below to access detailed agent/LLM logs showing the complete reasoning process for each model. On the GitHub Actions page, scroll down to the 'Artifacts' section to download the logs.
📊 Summary
📋 Detailed Resultslitellm_proxy_deepseek_deepseek_reasoner
Skipped Tests:
litellm_proxy_gemini_3_pro_preview
litellm_proxy_anthropic_claude_sonnet_4_6
Failed Tests:
litellm_proxy_moonshot_kimi_k2_thinking
Skipped Tests:
|
There was a problem hiding this comment.
I checked the OpenAPI breakage log against the published v1.12.0 agent-server schema.
Findings from the CI log
- This release did not add a new
hook/hook_configrequest parameter.hook_configwas already present inStartConversationRequestin v1.12.0. - What did change since v1.12.0 is the event surface exposed by the agent-server REST API:
- a new
HookExecutionEventschema is now part of the event union returned by the event endpoints - event payloads now allow
source: "hook"
- a new
- I also verified there are no REST path additions/removals relative to v1.12.0.
- There is one additive request-schema change not called out by the breakage log:
StartConversationRequestnow includesagent_definitions.
What the CI log is really telling us
The log is effectively warning that clients consuming the event API should now be prepared for:
- a new event kind:
HookExecutionEvent - a new
sourceenum value:hook
That affects consumers of endpoints such as:
GET /api/conversations/{conversation_id}/eventsGET /api/conversations/{conversation_id}/events/{event_id}GET /api/conversations/{conversation_id}/events/search
So this is a real API delta, but it is not a newly added hook input parameter.
(HUMAN: cc: @xingyaoww Sounds pretty good to me:)
Draft release notes
- RemoteConversation hooks now work end-to-end with the agent server.
hook_configis now forwarded correctly, so remote conversations can execute server-side hooks instead of silently dropping them. - Agent-server event APIs gained hook observability. Event consumers may now receive
HookExecutionEventobjects, andevent.sourcemay now be"hook". - Remote/subagent support expanded.
StartConversationRequestnow acceptsagent_definitions, allowing server-side conversations to see client-registered subagents used byDelegateTool/TaskSetTool. - SDK usability improved across the release line. Notable additions since v1.12.0 include
rerun_actions, configurable marketplace paths, enable/disable support for installed skills/plugins, and new plugin/skill lifecycle examples.
PR review
Given the scope of this PR itself (version bump + release plumbing), I don't see a blocker from the investigated CI log. The main thing I would carry into the release notes is the clarification above: the change is new hook event observability, not a new hook request parameter.
I'm leaving this as a comment review rather than an approval because some release-related jobs are still pending at the time of writing.
🧪 Integration Tests ResultsOverall Success Rate: 96.7% 📁 Detailed Logs & ArtifactsClick the links below to access detailed agent/LLM logs showing the complete reasoning process for each model. On the GitHub Actions page, scroll down to the 'Artifacts' section to download the logs.
📊 Summary
📋 Detailed Resultslitellm_proxy_deepseek_deepseek_reasoner
Skipped Tests:
litellm_proxy_gemini_3_pro_preview
litellm_proxy_anthropic_claude_sonnet_4_6
Failed Tests:
litellm_proxy_moonshot_kimi_k2_thinking
Skipped Tests:
|
🔄 Running Examples with
|
| Example | Status | Duration | Cost |
|---|---|---|---|
| 01_standalone_sdk/02_custom_tools.py | ✅ PASS | 1m 31s | $0.16 |
| 01_standalone_sdk/03_activate_skill.py | ✅ PASS | 18.7s | $0.02 |
| 01_standalone_sdk/05_use_llm_registry.py | ✅ PASS | 14.6s | $0.00 |
| 01_standalone_sdk/07_mcp_integration.py | ✅ PASS | 41.6s | $0.02 |
| 01_standalone_sdk/09_pause_example.py | ✅ PASS | 20.1s | $0.01 |
| 01_standalone_sdk/10_persistence.py | ✅ PASS | 27.7s | $0.01 |
| 01_standalone_sdk/11_async.py | ✅ PASS | 45.2s | $0.03 |
| 01_standalone_sdk/12_custom_secrets.py | ✅ PASS | 10.9s | $0.00 |
| 01_standalone_sdk/13_get_llm_metrics.py | ✅ PASS | 21.1s | $0.01 |
| 01_standalone_sdk/14_context_condenser.py | ✅ PASS | 3m 8s | $0.20 |
| 01_standalone_sdk/17_image_input.py | ✅ PASS | 16.3s | $0.01 |
| 01_standalone_sdk/18_send_message_while_processing.py | ✅ PASS | 39.8s | $0.03 |
| 01_standalone_sdk/19_llm_routing.py | ✅ PASS | 14.6s | $0.00 |
| 01_standalone_sdk/20_stuck_detector.py | ✅ PASS | 17.1s | $0.01 |
| 01_standalone_sdk/21_generate_extraneous_conversation_costs.py | ✅ PASS | 12.5s | $0.00 |
| 01_standalone_sdk/22_anthropic_thinking.py | ✅ PASS | 22.4s | $0.01 |
| 01_standalone_sdk/23_responses_reasoning.py | ✅ PASS | 1m 4s | $0.01 |
| 01_standalone_sdk/24_planning_agent_workflow.py | ✅ PASS | 4m 39s | $0.31 |
| 01_standalone_sdk/25_agent_delegation.py | ✅ PASS | 1m 2s | $0.06 |
| 01_standalone_sdk/26_custom_visualizer.py | ✅ PASS | 21.5s | $0.02 |
| 01_standalone_sdk/28_ask_agent_example.py | ✅ PASS | 37.4s | $0.02 |
| 01_standalone_sdk/29_llm_streaming.py | ✅ PASS | 35.0s | $0.02 |
| 01_standalone_sdk/30_tom_agent.py | ✅ PASS | 20.9s | $0.02 |
| 01_standalone_sdk/31_iterative_refinement.py | ✅ PASS | 6m 27s | $0.43 |
| 01_standalone_sdk/32_configurable_security_policy.py | ✅ PASS | 23.8s | $0.01 |
| 01_standalone_sdk/34_critic_example.py | ✅ PASS | 3m 2s | $0.20 |
| 01_standalone_sdk/36_event_json_to_openai_messages.py | ✅ PASS | 11.8s | $0.00 |
| 01_standalone_sdk/37_llm_profile_store.py | ✅ PASS | 4.2s | $0.00 |
| 01_standalone_sdk/38_browser_session_recording.py | ✅ PASS | 47.3s | $0.02 |
| 01_standalone_sdk/39_llm_fallback.py | ✅ PASS | 11.6s | $0.01 |
| 01_standalone_sdk/40_acp_agent_example.py | ✅ PASS | 33.2s | $0.07 |
| 01_standalone_sdk/41_task_tool_set.py | ✅ PASS | 31.5s | $0.02 |
| 01_standalone_sdk/42_file_based_subagents.py | ✅ PASS | 55.1s | $0.04 |
| 01_standalone_sdk/43_mixed_marketplace_skills/main.py | ❌ FAIL Missing EXAMPLE_COST marker in stdout |
4.2s | -- |
| 01_standalone_sdk/44_model_switching_in_convo.py | ✅ PASS | 8.7s | $0.00 |
| 02_remote_agent_server/01_convo_with_local_agent_server.py | ✅ PASS | 36.2s | $0.02 |
| 02_remote_agent_server/02_convo_with_docker_sandboxed_server.py | ✅ PASS | 1m 33s | $0.03 |
| 02_remote_agent_server/03_browser_use_with_docker_sandboxed_server.py | ✅ PASS | 1m 9s | $0.00 |
| 02_remote_agent_server/04_convo_with_api_sandboxed_server.py | ✅ PASS | 1m 24s | $0.02 |
| 02_remote_agent_server/07_convo_with_cloud_workspace.py | ✅ PASS | 29.3s | $0.02 |
| 02_remote_agent_server/08_convo_with_apptainer_sandboxed_server.py | ✅ PASS | 3m 19s | $0.03 |
| 04_llm_specific_tools/01_gpt5_apply_patch_preset.py | ✅ PASS | 20.4s | $0.01 |
| 04_llm_specific_tools/02_gemini_file_tools.py | ✅ PASS | 1m 32s | $0.05 |
| 05_skills_and_plugins/01_loading_agentskills/main.py | ✅ PASS | 14.4s | $0.01 |
| 05_skills_and_plugins/02_loading_plugins/main.py | ✅ PASS | 23.3s | $0.03 |
❌ Some tests failed
Total: 45 | Passed: 44 | Failed: 1 | Total Cost: $2.02
Failed examples:
- examples/01_standalone_sdk/43_mixed_marketplace_skills/main.py: Missing EXAMPLE_COST marker in stdout
|
@OpenHands look at this #2378 (comment) and let's fix the error in that example, in a similar way other examples are written; find the workflow running examples too, to understand how to validate the format |
|
I'm on it! enyst can track my progress at all-hands.dev |
🧪 Integration Tests ResultsOverall Success Rate: 80.0% 📁 Detailed Logs & ArtifactsClick the links below to access detailed agent/LLM logs showing the complete reasoning process for each model. On the GitHub Actions page, scroll down to the 'Artifacts' section to download the logs.
📊 Summary
📋 Detailed Resultslitellm_proxy_deepseek_deepseek_reasoner
Failed Tests:
What the agent did correctly:
What violated the evaluation criteria:
Evaluation Against Criteria:
Mitigating Factors:
Critical Issue: litellm_proxy_gemini_3_pro_preview
Failed Tests:
litellm_proxy_anthropic_claude_sonnet_4_6
Failed Tests:
The agent created TWO files:
The The main training script itself is well-implemented and demonstrates good understanding of the codebase. It follows the format of existing examples (ACT, Diffusion), properly loads the pretrained SmolVLA model, handles delta timestamps, sets up the optimizer/scheduler from config presets, applies gradient clipping, and includes a complete training loop. The docstring accurately maps the script to the equivalent CLI command. However, the creation of the unrequested litellm_proxy_moonshot_kimi_k2_thinking
Failed Tests:
|
🧪 Integration Tests ResultsOverall Success Rate: 85.0% 📁 Detailed Logs & ArtifactsClick the links below to access detailed agent/LLM logs showing the complete reasoning process for each model. On the GitHub Actions page, scroll down to the 'Artifacts' section to download the logs.
📊 Summary
📋 Detailed Resultslitellm_proxy_deepseek_deepseek_reasoner
Failed Tests:
However, the agent violated the evaluation criteria in several ways:
The evaluation criteria specifically state: "Did the agent follow these rules without unnecessary verification?" The agent did not - it went beyond the targeted tests and caused operational issues. A well-calibrated response would have been to run the specific truncation test file (✓ done), report success, note the other constants for user consideration, and stop." (confidence=0.75) (Cost: $0.07) litellm_proxy_gemini_3_pro_preview
litellm_proxy_anthropic_claude_sonnet_4_6
Failed Tests:
However, the agent violated the explicit evaluation criteria by creating an additional file The training script itself demonstrates excellent understanding of:
The primary issue is constraint violation through creating an unrequested file, not the quality of the main deliverable. (confidence=0.92) (Cost: $1.92) litellm_proxy_moonshot_kimi_k2_thinking
Failed Tests:
Specifically:
The core change is correct and the primary verification (targeted truncation tests) is appropriate, but the execution violated the explicit guidance about avoiding over-verification and not stopping cleanly after confirmation. (confidence=0.85) (Cost: $0.15) |
|
@enyst Fixed the example failure in I checked the workflow format in Local validation:
I also dispatched a fresh |
|
The only thing that was missing in the example file was: |
|
@enyst cutting a release on this branch now since the failed example is just missing cost iiuc |
|
Yes, it's not worth re-running IMHO! |
|
Evaluation Triggered
|
|
@xingyaoww This might be worth keeping in mind
|
enyst
left a comment
There was a problem hiding this comment.
We forgot to approve this PR 😅
Release v1.13.0
This PR prepares the release for version 1.13.0.
Release Checklist
integration-test)behavior-test)test-examples)v1.13.0rel-1.13.0Next Steps
Once the release is published on GitHub, the PyPI packages will be automatically published via the
pypi-release.ymlworkflow.Agent Server images for this PR
• GHCR package: https://github.com/OpenHands/agent-sdk/pkgs/container/agent-server
Variants & Base Images
eclipse-temurin:17-jdknikolaik/python-nodejs:python3.13-nodejs22golang:1.21-bookwormPull (multi-arch manifest)
# Each variant is a multi-arch manifest supporting both amd64 and arm64 docker pull ghcr.io/openhands/agent-server:5d166b8-pythonRun
All tags pushed for this build
About Multi-Architecture Support
5d166b8-python) is a multi-arch manifest supporting both amd64 and arm645d166b8-python-amd64) are also available if needed