Skip to content

LocalEvalSetResultsManager writes double-encoded JSON files #3993

@ftnext

Description

@ftnext

** Please make sure you read the contribution guide and file the issues in the right place. **
Contribution guide.

Describe the bug
LocalEvalSetResultsManager.save_eval_set_result() writes JSON results as a JSON-encoded string (double-encoded), so the file contents look like "{\"key\": \"value\"}" instead of a JSON object.

To Reproduce

based on https://github.com/google/adk-python/blob/v1.21.0/tests/integration/test_single_agent.py#L19-L25

from google.adk.evaluation.agent_evaluator import AgentEvaluator
from google.adk.evaluation.eval_config import get_eval_metrics_from_config
from google.adk.evaluation.local_eval_set_results_manager import (
    LocalEvalSetResultsManager,
)
from google.adk.evaluation.simulation.user_simulator_provider import (
    UserSimulatorProvider,
)
import pytest


@pytest.mark.asyncio
async def test_eval_agent(tmp_path):
  test_file = (
      "tests/integration/fixture/home_automation_agent/simple_test.test.json"
  )
  eval_config = AgentEvaluator.find_config_for_test_file(test_file)
  eval_set = AgentEvaluator._load_eval_set_from_file(
      test_file, eval_config, initial_session={}
  )
  eval_metrics = get_eval_metrics_from_config(eval_config)
  user_simulator_provider = UserSimulatorProvider(
      user_simulator_config=eval_config.user_simulator_config
  )

  agent_for_eval = await AgentEvaluator._get_agent_for_eval(
      module_name="tests.integration.fixture.home_automation_agent",
      agent_name=None,
  )

  eval_results_by_eval_id = (
      await AgentEvaluator._get_eval_results_by_eval_id(
          agent_for_eval=agent_for_eval,
          eval_set=eval_set,
          eval_metrics=eval_metrics,
          num_runs=4,
          user_simulator_provider=user_simulator_provider,
      )
  )

  results_manager = LocalEvalSetResultsManager(agents_dir=str(tmp_path))
  for eval_case_results in eval_results_by_eval_id.values():
    results_manager.save_eval_set_result(
        app_name="test_app",
        eval_set_id=eval_set.eval_set_id,
        eval_case_results=eval_case_results,
    )

  failures = []
  for eval_case_results in eval_results_by_eval_id.values():
    eval_metric_results = (
        AgentEvaluator._get_eval_metric_results_with_invocation(
            eval_case_results
        )
    )
    failures_per_eval_case = AgentEvaluator._process_metrics_and_get_failures(
        eval_metric_results=eval_metric_results,
        print_detailed_results=True,
        agent_module=None,
    )
    failures.extend(failures_per_eval_case)

  failure_message = "Following are all the test failures.\n" + "\n".join(
      failures
  )
  assert not failures, failure_message
"{\"eval_set_result_id\":\"test_app_b305bd06-38c5-4796-b9c7-d9c7454338b9_1766325534.0213041\",

Expected behavior
The saved file should contain a JSON object (e.g. {"eval_set_result_id": "...", ... }), not a JSON string.

Screenshots
N/A

Desktop (please complete the following information):

  • OS: macOS
  • Python version(python -V): Python 3.13.8
  • ADK version(pip show google-adk): v1.21.0

Model Information:

  • Are you using LiteLLM: No
  • Which model is being used: (gemini-2.0-flash-001)

Additional context
Likely caused by double-encoding: model_dump_json() returns a JSON string which is then passed through json.dumps() again. Using model_dump() + json.dump() (or writing model_dump_json() directly) would avoid double encoding.

# Convert to json and write to file.
eval_set_result_json = eval_set_result.model_dump_json()
eval_set_result_file_path = os.path.join(
app_eval_history_dir,
eval_set_result.eval_set_result_name + _EVAL_SET_RESULT_FILE_EXTENSION,
)
logger.info("Writing eval result to file: %s", eval_set_result_file_path)
with open(eval_set_result_file_path, "w", encoding="utf-8") as f:
f.write(json.dumps(eval_set_result_json, indent=2))

Metadata

Metadata

Assignees

Labels

eval[Component] This issue is related to evaluation

Type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions