DRAFT: refactor(llm): extract LLMCapabilities class from LLM#2279
DRAFT: refactor(llm): extract LLMCapabilities class from LLM#2279VascoSch92 wants to merge 4 commits intomainfrom
Conversation
Extract capability detection logic from the LLM class into a new LLMCapabilities class to reduce complexity and improve maintainability. The LLM class was identified as a "God Class" with 1,472 lines, 37 methods, and 10+ mixed responsibilities. This refactoring addresses part of issue #2274 by extracting the capability detection responsibility. Changes: - Create new openhands/sdk/llm/capabilities.py with LLMCapabilities class - LLMCapabilities handles: - Model information lookup from litellm - Context window validation - Vision support detection - Prompt caching support detection - Responses API support detection - Auto-detection of max_input_tokens and max_output_tokens - Update LLM class to delegate capability methods to LLMCapabilities - LLM.vision_is_active(), is_caching_prompt_active(), uses_responses_api() now delegate to the internal _capabilities instance - Move constants MIN_CONTEXT_WINDOW_TOKENS, ENV_ALLOW_SHORT_CONTEXT_WINDOWS, DEFAULT_MAX_OUTPUT_TOKENS_CAP to capabilities.py - Add comprehensive unit tests for LLMCapabilities - Update existing tests to patch capabilities module instead of llm module This is Phase 1 of the LLM class decomposition plan, reducing the LLM class by ~100 lines while maintaining backward compatibility. Co-authored-by: openhands <openhands@all-hands.dev>
API breakage checks (Griffe)Result: Failed Log excerpt (first 1000 characters) |
Agent server REST API breakage checks (OpenAPI)Result: Passed |
all-hands-bot
left a comment
There was a problem hiding this comment.
🟡 Code Review: Acceptable Direction, Some Design Concerns
Taste Rating: 🟡 Acceptable - The refactoring achieves its goal of extracting capabilities from the God Class, but introduces some awkward ownership patterns and code smells.
Verdict: ✅ Worth merging after addressing design concerns (note: this is a DRAFT, so take time to refine)
Key Insight: The core refactoring is sound, but the dual ownership of token limits and parameter explosion suggest the abstraction boundary could be cleaner. Consider having LLMCapabilities fully own these values instead of syncing them back.
See inline comments for specific issues.
all-hands-bot
left a comment
There was a problem hiding this comment.
🟡 Code Review: Solid Extraction, Maintenance Concerns
Taste Rating: 🟡 Acceptable - The refactoring achieves its goal of extracting capabilities from the God Class and addresses previous review concerns (CapabilitiesConfig, proper typing, ownership clarity). The extraction is real and tested.
Verdict: 💬 Discuss before merging - This changes capability detection logic (vision, caching, token limits) which could affect eval performance. The extraction is sound, but hardcoded model-specific limits will become a maintenance burden.
Key Insight: The fundamental refactoring is good, but you're moving hardcoded model-specific magic numbers from one class to another. That's not actually solving the problem - it's just relocating it.
Note: This PR modifies LLM capability detection which could affect benchmark behavior. A human maintainer should verify with lightweight evals before merging.
all-hands-bot
left a comment
There was a problem hiding this comment.
🟡 Solid Extraction, One Critical Behavior Change
Taste Rating: 🟡 Acceptable - The refactoring achieves clean separation and the CapabilitiesConfig dataclass solves the parameter explosion issue from previous reviews. However, there's a subtle behavior change in o3 token limit handling that contradicts the PR's "no behavior changes" claim.
The Core Issue: The comment says limits are "upper caps" but the implementation unconditionally overrides model_info. For o3 models specifically, the old logic capped at 100k AFTER checking model_info, so a model with max_output_tokens=50k in model_info would keep 50k. The new logic returns early and always sets 100k, ignoring model_info entirely.
See inline comments for details.
| MODEL_OUTPUT_TOKEN_LIMITS: Final[dict[str, int]] = { | ||
| "claude-3-7-sonnet": 64000, | ||
| "claude-sonnet-4": 64000, | ||
| "kimi-k2-thinking": 64000, | ||
| "o3": 100000, |
There was a problem hiding this comment.
🟠 Important: Comment Doesn't Match Implementation
This comment claims the limit is an "upper cap" that clamps down higher values from litellm, but the implementation unconditionally overrides model_info (see line 165 early return).
The behavior change: For o3 models, the old code applied the limit as a cap AFTER model_info detection:
# Old: Applied after model_info check
if "o3" in self.model:
if self.max_output_tokens is None or self.max_output_tokens > 100000:
self.max_output_tokens = 100000This meant if model_info said max_output_tokens=50k, it would keep 50k. If model_info said 150k, it would clamp to 100k.
The new code checks MODEL_OUTPUT_TOKEN_LIMITS FIRST and returns early, so model_info is never consulted for o3 models. This changes behavior for o3 models where model_info might have a value < 100k.
This contradicts the PR's claim of "No external behavior changes".
Either fix the comment to say "these override model_info" OR fix the implementation to actually apply limits as caps:
# Get base value from model_info first
base_value = None
if self._model_info is not None:
if isinstance(self._model_info.get("max_output_tokens"), int):
base_value = self._model_info.get("max_output_tokens")
elif isinstance(max_tokens_value := self._model_info.get("max_tokens"), int):
base_value = min(max_tokens_value, DEFAULT_MAX_OUTPUT_TOKENS_CAP)
# Apply model-specific caps
for model_prefix, limit in MODEL_OUTPUT_TOKEN_LIMITS.items():
if model_prefix in model:
self.detected_max_output_tokens = min(base_value, limit) if base_value else limit
return| model = self._config.model | ||
|
|
||
| # 1. Check model-specific overrides (from MODEL_OUTPUT_TOKEN_LIMITS) | ||
| for model_prefix, limit in MODEL_OUTPUT_TOKEN_LIMITS.items(): |
There was a problem hiding this comment.
🔴 Critical: Early Return Changes o3 Behavior
This early return means model_info is never checked for models matching MODEL_OUTPUT_TOKEN_LIMITS.
For Claude models: Old behavior also skipped model_info (they were in an if/elif), so no change.
For o3 models: Old behavior checked model_info FIRST, then applied 100k as a cap. If model_info said max_output_tokens=50k, the old code kept 50k. New code unconditionally sets 100k.
Test gap: Your test test_o3_output_tokens_clamped only checks the case where model_info > 100k. Add a test for model_info < 100k to verify intended behavior.
Is the new behavior (always 100k for o3) intentional? If so, update the comment on lines 68-72 and the PR description. If not, move the model_info check before the MODEL_OUTPUT_TOKEN_LIMITS loop.
Summary
This PR extracts capability detection logic from the
LLMclass into a newLLMCapabilitiesclass to reduce complexity and improve maintainability.Fixes #2274 (Phase 1: LLMCapabilities extraction)
Changes
New File:
openhands/sdk/llm/capabilities.pyCreated
LLMCapabilitiesclass that encapsulates:Updated:
openhands/sdk/llm/llm.py_model_infoprivate attribute with_capabilities: LLMCapabilities | None_set_env_side_effectsvalidator to initializeLLMCapabilitiesvision_is_active(),is_caching_prompt_active(),uses_responses_api()now delegate to_capabilitiesmodel_infoproperty now delegates to_capabilities.model_info_init_model_info_and_caps(),_validate_context_window_size(),_supports_vision()MIN_CONTEXT_WINDOW_TOKENS,ENV_ALLOW_SHORT_CONTEXT_WINDOWS,DEFAULT_MAX_OUTPUT_TOKENS_CAPget_litellm_model_info,supports_vision,LLMContextWindowTooSmallErrorTests
LLMCapabilitiesintests/sdk/llm/test_capabilities.py(19 tests)openhands.sdk.llm.capabilitiesinstead ofopenhands.sdk.llm.llmImpact
llm.py(moved tocapabilities.py)LLMCapabilitieshandles only capability detectionDesign Rationale
The
LLMclass was identified as a "God Class" with:This refactoring follows the issue's proposed solution to extract an
LLMCapabilitiesclass that handles model capability detection, which is now isolated and independently testable.Verification
Next Steps (from issue #2274)
This PR completes Phase 1: Low-Risk Extractions for the LLM class. Future work includes:
MessageFormatterclass (Phase 1 continuation)Agent Server images for this PR
• GHCR package: https://github.com/OpenHands/agent-sdk/pkgs/container/agent-server
Variants & Base Images
eclipse-temurin:17-jdknikolaik/python-nodejs:python3.12-nodejs22golang:1.21-bookwormPull (multi-arch manifest)
# Each variant is a multi-arch manifest supporting both amd64 and arm64 docker pull ghcr.io/openhands/agent-server:c07144c-pythonRun
All tags pushed for this build
About Multi-Architecture Support
c07144c-python) is a multi-arch manifest supporting both amd64 and arm64c07144c-python-amd64) are also available if needed