Skip to content

DRAFT: refactor(llm): extract LLMCapabilities class from LLM#2279

Draft
VascoSch92 wants to merge 4 commits intomainfrom
openhands/extract-llm-capabilities
Draft

DRAFT: refactor(llm): extract LLMCapabilities class from LLM#2279
VascoSch92 wants to merge 4 commits intomainfrom
openhands/extract-llm-capabilities

Conversation

@VascoSch92
Copy link
Contributor

@VascoSch92 VascoSch92 commented Mar 3, 2026

Summary

This PR extracts capability detection logic from the LLM class into a new LLMCapabilities class to reduce complexity and improve maintainability.

Fixes #2274 (Phase 1: LLMCapabilities extraction)

Changes

New File: openhands/sdk/llm/capabilities.py

Created LLMCapabilities class that encapsulates:

  • Model information lookup from litellm
  • Context window validation (MIN_CONTEXT_WINDOW_TOKENS)
  • Vision support detection
  • Prompt caching support detection
  • Responses API support detection
  • Auto-detection of max_input_tokens and max_output_tokens

Updated: openhands/sdk/llm/llm.py

  • Replaced _model_info private attribute with _capabilities: LLMCapabilities | None
  • Updated _set_env_side_effects validator to initialize LLMCapabilities
  • vision_is_active(), is_caching_prompt_active(), uses_responses_api() now delegate to _capabilities
  • model_info property now delegates to _capabilities.model_info
  • Removed now-unused methods: _init_model_info_and_caps(), _validate_context_window_size(), _supports_vision()
  • Moved constants to capabilities.py: MIN_CONTEXT_WINDOW_TOKENS, ENV_ALLOW_SHORT_CONTEXT_WINDOWS, DEFAULT_MAX_OUTPUT_TOKENS_CAP
  • Removed unused imports: get_litellm_model_info, supports_vision, LLMContextWindowTooSmallError

Tests

  • Added comprehensive unit tests for LLMCapabilities in tests/sdk/llm/test_capabilities.py (19 tests)
  • Updated existing tests to patch openhands.sdk.llm.capabilities instead of openhands.sdk.llm.llm

Impact

  • No external behavior changes: All public APIs remain unchanged
  • All existing tests pass: 628 tests pass
  • Line count reduction: Removed ~100 lines from llm.py (moved to capabilities.py)
  • Single responsibility: LLMCapabilities handles only capability detection

Design Rationale

The LLM class was identified as a "God Class" with:

  • 1,472 lines
  • 37 methods
  • 10+ mixed responsibilities

This refactoring follows the issue's proposed solution to extract an LLMCapabilities class that handles model capability detection, which is now isolated and independently testable.

Verification

# All tests pass
uv run pytest tests/sdk/llm/ --timeout=300 -q
# 628 passed, 6 warnings

# Pre-commit checks pass
uv run pre-commit run --files openhands-sdk/openhands/sdk/llm/llm.py openhands-sdk/openhands/sdk/llm/capabilities.py
# All checks pass

Next Steps (from issue #2274)

This PR completes Phase 1: Low-Risk Extractions for the LLM class. Future work includes:

  • Extract MessageFormatter class (Phase 1 continuation)
  • Add factory parameters for Metrics/Telemetry (Phase 2)

Agent Server images for this PR

GHCR package: https://github.com/OpenHands/agent-sdk/pkgs/container/agent-server

Variants & Base Images

Variant Architectures Base Image Docs / Tags
java amd64, arm64 eclipse-temurin:17-jdk Link
python amd64, arm64 nikolaik/python-nodejs:python3.12-nodejs22 Link
golang amd64, arm64 golang:1.21-bookworm Link

Pull (multi-arch manifest)

# Each variant is a multi-arch manifest supporting both amd64 and arm64
docker pull ghcr.io/openhands/agent-server:c07144c-python

Run

docker run -it --rm \
  -p 8000:8000 \
  --name agent-server-c07144c-python \
  ghcr.io/openhands/agent-server:c07144c-python

All tags pushed for this build

ghcr.io/openhands/agent-server:c07144c-golang-amd64
ghcr.io/openhands/agent-server:c07144c-golang_tag_1.21-bookworm-amd64
ghcr.io/openhands/agent-server:c07144c-golang-arm64
ghcr.io/openhands/agent-server:c07144c-golang_tag_1.21-bookworm-arm64
ghcr.io/openhands/agent-server:c07144c-java-amd64
ghcr.io/openhands/agent-server:c07144c-eclipse-temurin_tag_17-jdk-amd64
ghcr.io/openhands/agent-server:c07144c-java-arm64
ghcr.io/openhands/agent-server:c07144c-eclipse-temurin_tag_17-jdk-arm64
ghcr.io/openhands/agent-server:c07144c-python-amd64
ghcr.io/openhands/agent-server:c07144c-nikolaik_s_python-nodejs_tag_python3.12-nodejs22-amd64
ghcr.io/openhands/agent-server:c07144c-python-arm64
ghcr.io/openhands/agent-server:c07144c-nikolaik_s_python-nodejs_tag_python3.12-nodejs22-arm64
ghcr.io/openhands/agent-server:c07144c-golang
ghcr.io/openhands/agent-server:c07144c-java
ghcr.io/openhands/agent-server:c07144c-python

About Multi-Architecture Support

  • Each variant tag (e.g., c07144c-python) is a multi-arch manifest supporting both amd64 and arm64
  • Docker automatically pulls the correct architecture for your platform
  • Individual architecture tags (e.g., c07144c-python-amd64) are also available if needed

Extract capability detection logic from the LLM class into a new
LLMCapabilities class to reduce complexity and improve maintainability.

The LLM class was identified as a "God Class" with 1,472 lines, 37 methods,
and 10+ mixed responsibilities. This refactoring addresses part of
issue #2274 by extracting the capability detection responsibility.

Changes:
- Create new openhands/sdk/llm/capabilities.py with LLMCapabilities class
- LLMCapabilities handles:
  - Model information lookup from litellm
  - Context window validation
  - Vision support detection
  - Prompt caching support detection
  - Responses API support detection
  - Auto-detection of max_input_tokens and max_output_tokens
- Update LLM class to delegate capability methods to LLMCapabilities
- LLM.vision_is_active(), is_caching_prompt_active(), uses_responses_api()
  now delegate to the internal _capabilities instance
- Move constants MIN_CONTEXT_WINDOW_TOKENS, ENV_ALLOW_SHORT_CONTEXT_WINDOWS,
  DEFAULT_MAX_OUTPUT_TOKENS_CAP to capabilities.py
- Add comprehensive unit tests for LLMCapabilities
- Update existing tests to patch capabilities module instead of llm module

This is Phase 1 of the LLM class decomposition plan, reducing the LLM
class by ~100 lines while maintaining backward compatibility.

Co-authored-by: openhands <openhands@all-hands.dev>
@github-actions
Copy link
Contributor

github-actions bot commented Mar 3, 2026

API breakage checks (Griffe)

Result: Failed

Log excerpt (first 1000 characters)

============================================================
Checking openhands-sdk (openhands.sdk)
============================================================
Comparing openhands-sdk 1.11.5 against 1.11.4
::notice title=openhands-sdk API::Ignoring Field metadata-only change (non-breaking): load_public_skills
::notice title=openhands-sdk API::Ignoring Field metadata-only change (non-breaking): temperature
::warning file=openhands-sdk/openhands/sdk/llm/llm.py,line=184,title=LLM.top_p::Attribute value was changed: `Field(default=1.0, ge=0, le=1)` -> `Field(default=None, ge=0, le=1, description='Nucleus sampling parameter. Defaults to None (uses provider default). Set to a value between 0 and 1 to control diversity of outputs.')`
::error title=SemVer::Breaking changes detected (1); require at least minor version bump from 1.11.x, but new is 1.11.5

============================================================
Checking openhands-workspace (openhands.workspace)
============================

Action log

@github-actions
Copy link
Contributor

github-actions bot commented Mar 3, 2026

Agent server REST API breakage checks (OpenAPI)

Result: Passed

Action log

@github-actions
Copy link
Contributor

github-actions bot commented Mar 3, 2026

Coverage

Coverage Report •
FileStmtsMissCoverMissing
openhands-sdk/openhands/sdk/llm
   llm.py4447583%419, 472, 710, 816, 818–819, 847, 893, 904–906, 910–914, 922–924, 934–936, 939–940, 944, 946–947, 949, 1234–1235, 1244, 1257, 1259–1264, 1266–1283, 1286–1290, 1292–1293, 1299–1308, 1359, 1361
TOTAL19131570470% 

@VascoSch92 VascoSch92 requested a review from all-hands-bot March 3, 2026 16:13
Copy link
Collaborator

@all-hands-bot all-hands-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 Code Review: Acceptable Direction, Some Design Concerns

Taste Rating: 🟡 Acceptable - The refactoring achieves its goal of extracting capabilities from the God Class, but introduces some awkward ownership patterns and code smells.

Verdict:Worth merging after addressing design concerns (note: this is a DRAFT, so take time to refine)

Key Insight: The core refactoring is sound, but the dual ownership of token limits and parameter explosion suggest the abstraction boundary could be cleaner. Consider having LLMCapabilities fully own these values instead of syncing them back.


See inline comments for specific issues.

@VascoSch92 VascoSch92 requested a review from all-hands-bot March 3, 2026 18:00
Copy link
Collaborator

@all-hands-bot all-hands-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 Code Review: Solid Extraction, Maintenance Concerns

Taste Rating: 🟡 Acceptable - The refactoring achieves its goal of extracting capabilities from the God Class and addresses previous review concerns (CapabilitiesConfig, proper typing, ownership clarity). The extraction is real and tested.

Verdict: 💬 Discuss before merging - This changes capability detection logic (vision, caching, token limits) which could affect eval performance. The extraction is sound, but hardcoded model-specific limits will become a maintenance burden.

Key Insight: The fundamental refactoring is good, but you're moving hardcoded model-specific magic numbers from one class to another. That's not actually solving the problem - it's just relocating it.


Note: This PR modifies LLM capability detection which could affect benchmark behavior. A human maintainer should verify with lightweight evals before merging.

@VascoSch92 VascoSch92 requested a review from all-hands-bot March 3, 2026 18:15
Copy link
Collaborator

@all-hands-bot all-hands-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 Solid Extraction, One Critical Behavior Change

Taste Rating: 🟡 Acceptable - The refactoring achieves clean separation and the CapabilitiesConfig dataclass solves the parameter explosion issue from previous reviews. However, there's a subtle behavior change in o3 token limit handling that contradicts the PR's "no behavior changes" claim.

The Core Issue: The comment says limits are "upper caps" but the implementation unconditionally overrides model_info. For o3 models specifically, the old logic capped at 100k AFTER checking model_info, so a model with max_output_tokens=50k in model_info would keep 50k. The new logic returns early and always sets 100k, ignoring model_info entirely.

See inline comments for details.

Comment on lines +68 to +72
MODEL_OUTPUT_TOKEN_LIMITS: Final[dict[str, int]] = {
"claude-3-7-sonnet": 64000,
"claude-sonnet-4": 64000,
"kimi-k2-thinking": 64000,
"o3": 100000,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟠 Important: Comment Doesn't Match Implementation

This comment claims the limit is an "upper cap" that clamps down higher values from litellm, but the implementation unconditionally overrides model_info (see line 165 early return).

The behavior change: For o3 models, the old code applied the limit as a cap AFTER model_info detection:

# Old: Applied after model_info check
if "o3" in self.model:
    if self.max_output_tokens is None or self.max_output_tokens > 100000:
        self.max_output_tokens = 100000

This meant if model_info said max_output_tokens=50k, it would keep 50k. If model_info said 150k, it would clamp to 100k.

The new code checks MODEL_OUTPUT_TOKEN_LIMITS FIRST and returns early, so model_info is never consulted for o3 models. This changes behavior for o3 models where model_info might have a value < 100k.

This contradicts the PR's claim of "No external behavior changes".

Either fix the comment to say "these override model_info" OR fix the implementation to actually apply limits as caps:

# Get base value from model_info first
base_value = None
if self._model_info is not None:
    if isinstance(self._model_info.get("max_output_tokens"), int):
        base_value = self._model_info.get("max_output_tokens")
    elif isinstance(max_tokens_value := self._model_info.get("max_tokens"), int):
        base_value = min(max_tokens_value, DEFAULT_MAX_OUTPUT_TOKENS_CAP)

# Apply model-specific caps
for model_prefix, limit in MODEL_OUTPUT_TOKEN_LIMITS.items():
    if model_prefix in model:
        self.detected_max_output_tokens = min(base_value, limit) if base_value else limit
        return

model = self._config.model

# 1. Check model-specific overrides (from MODEL_OUTPUT_TOKEN_LIMITS)
for model_prefix, limit in MODEL_OUTPUT_TOKEN_LIMITS.items():
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 Critical: Early Return Changes o3 Behavior

This early return means model_info is never checked for models matching MODEL_OUTPUT_TOKEN_LIMITS.

For Claude models: Old behavior also skipped model_info (they were in an if/elif), so no change.

For o3 models: Old behavior checked model_info FIRST, then applied 100k as a cap. If model_info said max_output_tokens=50k, the old code kept 50k. New code unconditionally sets 100k.

Test gap: Your test test_o3_output_tokens_clamped only checks the case where model_info > 100k. Add a test for model_info < 100k to verify intended behavior.

Is the new behavior (always 100k for o3) intentional? If so, update the comment on lines 68-72 and the PR description. If not, move the model_info check before the MODEL_OUTPUT_TOKEN_LIMITS loop.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Refactoring Proposal: Large Classes Identified for Decomposition

3 participants