microsoft · xingdi-eric-yuan · Jun 11, 2025 · May 21, 2025 · Jun 6, 2025 · Jun 9, 2025
diff --git a/.github/workflows/tests.yml b/.github/workflows/tests.yml
@@ -33,7 +33,7 @@ jobs:
       - name: Test with pytest - PR
         if: github.event_name == 'pull_request'
         run: |
-          DEBUG_GYM_DEBUG=1 pytest -vv -n 16 -k "not test_swe_bench" --cov=debug_gym --cov-report=term-missing --cov-fail-under=80 --timeout=300
+          DEBUG_GYM_DEBUG=1 pytest -vv -n 16 -k "not test_swe_bench and not test_swe_smith" --cov=debug_gym --cov-report=term-missing --cov-fail-under=80 --timeout=300
       - name: Test with pytest
         if: github.event_name != 'pull_request'
         run: |

diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -18,4 +18,8 @@ Added in the [analysis](https://github.com/microsoft/debug-gym/tree/main/analysi
 
 ### 2025-05-28
 
-Improved the View tool, added the `start` and `end` arguments so the agent can specify a particular chunk of code to view.
+Improved the View tool, added the `start` and `end` arguments so the agent can specify a particular chunk of code to view.
+
+### 2025-06-11
+
+Added support to [SWE-smith](https://swesmith.com/). Users can use the tasks shipped with the official SWE-smith package, or customized tasks generated using SWE-smith.
diff --git a/MANIFEST.in b/MANIFEST.in
@@ -0,0 +1 @@
+include debug_gym/envs/configs/*.yaml
diff --git a/README.md b/README.md
@@ -98,6 +98,7 @@ We provide the below LLM-based agents, they all have minimal design and serve th
 | `debug_agent` | `pdb`, `rewrite`, `view`, `eval` | A minimal agent that dumps all available information into its prompt and queries the LLM to generate a command. |
 | `rewrite_agent` | `rewrite`, `view`, `eval`  | A `debug_agent` but `pdb` tool is disabled (an agent keeps rewriting). |
 | `debug_5_agent` | `pdb`, `rewrite`, `view`, `eval`  | A `debug_agent`, but `pdb` tool is only enabled after certain amount of rewrites. |
+| `solution_agent` | `pdb`, `eval`  | An oracle agent that applies a gold patch (only works with `swebench` and `swesmith` benchmarks for now). The agent checks that tests are failing before applying the patch, and passing after. It also checks that `pdb` tool can be used as expected. |
 
 ---
 
@@ -109,6 +110,7 @@ To demonstrate how to integrate `debug-gym` with coding tasks and repositories,
 | :-: | :----- |
 | `aider` | [https://github.com/Aider-AI/aider](https://github.com/Aider-AI/aider) |
 | `swebench`| [https://github.com/princeton-nlp/SWE-bench](https://github.com/princeton-nlp/SWE-bench) |
+| `swesmith`| [https://github.com/SWE-bench/SWE-smith](https://github.com/SWE-bench/SWE-smith) |
 | `mini_nightmare` | A set of 10 hand-crafted minimal buggy code snippet where rewrite only agents have harder time to tackle. Read details [here](https://github.com/microsoft/debug-gym/blob/main/data/mini_nightmare/mini_nightmare.md). |
 
 ---
@@ -122,28 +124,41 @@ Add `-v`, `--debug` to be verbose, or to enter debug mode.
 > [!WARNING]
 > When using --debug, you will need to press `c` to continue after each reasoning step.
 
-#### 3.1 Human Mode
+#### 3.1 Sanity Checks
+
+We can use the `solution_agent` to validate that your `swebench` and `swesmith` instances work as expected. This agent will apply a gold patch to the buggy code and check that the tests are failing before applying the patch, and passing after. It also checks that `pdb` tool can be used as expected.
+
+    python scripts/run.py scripts/config_swebench.yaml --agent solution_agent
+    python scripts/run.py scripts/config_swesmith.yaml --agent solution_agent
+
+#### 3.2 Human Mode
 
 We provide a human mode that enables developers to manually interact with `debug-gym`. To activate this mode, change the `llm_name` field in the `config_*.yaml` to be `"human"`. Once activated, at every step, the environment will expect a command input (in tool calling format). One can use the `Tab` key to get a list of tool calling templates and fill in any necessary arguments.
 
-#### 3.2. Overriding Values in Config
+#### 3.3. Overriding Values in Config
 
 `-p` is a handy way to override values defined in config. For example, the below command will run rewrite_agent agent on Aider with human mode (while in config file it specifies gpt-4o).
 
     python scripts/run.py scripts/config_aider.yaml --agent rewrite_agent -v -p rewrite_agent.llm_name="human"
 
-#### 3.3. Debugging a Custom Repository
+#### 3.4. Debugging a Custom Repository
 
 Modify `scripts/config.yaml`, especially the `env_kwargs` to set the path and entrypoint of the custom repository. We assume there is a `.debugignore` file and a `.debugreadonly` within the repository that labels files/folders that are not seen or not editable, respectively.
 
 As an example, we provide a buggy pytorch code repository in `data/pytorch`.
 
     python scripts/run.py scripts/config.yaml --agent <agent name>
 
-#### 3.4. Design Your Own Tool
+#### 3.5. Debugging a Custom SWE-Smith Instance
+
+[SWE-Smith](https://github.com/SWE-bench/SWE-smith) allows to generate new buggy code instances. Give a custom HuggingFace dataset (either local or remote) that has a similar structure as [SWE-bench/SWE-smith](https://huggingface.co/datasets/SWE-bench/SWE-smith), one can override the `-p base.env_kwargs.dataset_id=<dataset_id>` in the command line to run the agent on that dataset. For example, to run on a local dataset:
+
+    python scripts/run.py scripts/config_swesmith.yaml --agent <agent name> -p base.env_kwargs.dataset_id="path/to/local/dataset"
+
+#### 3.6. Design Your Own Tool
 `debug-gym`'s modular design makes it extensible. Users are encouraged to extend `debug-gym` to their specific usecases, for example by creating new tools that diversify an agent's action and observation spaces. For detailed instruction on designing new tools that are `debug-gym`-compatible, please refer to the [Technical Report](https://arxiv.org/abs/2503.21557).
 
-#### 3.5. Analysis and Visualization
+#### 3.7. Analysis and Visualization
 
 We provide a set of scripts to help analyze the log files (e.g., the `.jsonl` files) generated by the agent.
 - In the `analysis` folder, we provide scripts that used to generate the corresponding figures in our technical report.

diff --git a/debug_gym/agents/__init__.py b/debug_gym/agents/__init__.py
@@ -1,2 +1,3 @@
 from debug_gym.agents.debug_agent import Debug_5_Agent, DebugAgent
 from debug_gym.agents.rewrite_agent import RewriteAgent
+from debug_gym.agents.solution_agent import AgentSolution
diff --git a/debug_gym/agents/solution_agent.py b/debug_gym/agents/solution_agent.py
@@ -0,0 +1,69 @@
+import subprocess
+
+from debug_gym.agents.base_agent import BaseAgent, register_agent
+from debug_gym.gym.envs.swe_bench import SWEBenchEnv
+from debug_gym.gym.envs.swe_smith import SWESmithEnv
+from debug_gym.gym.tools.tool import ToolCall
+
+
+@register_agent
+class AgentSolution(BaseAgent):
+    name: str = "solution_agent"
+
+    def run(self, task_name=None, debug=False):
+        self.history.reset()
+
+        info = self.env.reset(options={"task_name": task_name})
+        self.history.step(info)
+
+        if info.done is True:
+            return True
+
+        self.logger.info(
+            f"Score: {info.score}/{info.max_score} ({info.score/info.max_score:.1%})"
+        )
+
+        # Make a simple pdb call to make sure it is working.
+        action = ToolCall(name="pdb", id="pdb", arguments={"command": "help help"})
+        pdb_help_info = self.env.step(action)
+        assert (
+            "h(elp)" in pdb_help_info.step_observation.observation
+        ), f"PDB command did not return expected help message.\n{pdb_help_info.step_observation.observation}"
+
+        # Send a pdb continue command, and check the output matches the one from env.reset.
+        action = ToolCall(name="pdb", id="pdb", arguments={"command": "continue"})
+        pdb_continue_info = self.env.step(action)
+
+        assert (
+            "Reached the end of the program. Restarting the debugging session."
+            in pdb_continue_info.step_observation.observation
+        ) or (
+            info.step_observation.observation.splitlines()[-1]
+            in pdb_continue_info.step_observation.observation
+        ), f"PDB command did not return expected continue message.\n{pdb_continue_info.step_observation.observation}"
+
+        try:
+            self.env.apply_gold_patch()
+        except NotImplementedError as e:
+            self.logger.error(
+                f"The environment {type(self.env)} is not compatible with SolutionAgent"
+                "Check the README.md to see which environments are compatible."
+            )
+            raise
+
+        if debug:
+            breakpoint()
+
+        action = ToolCall(name="eval", id="eval", arguments={})
+        info = self.env.step(action)
+
+        self.history.step(info)
+
+        self.logger.info(
+            f"Score: {info.score}/{info.max_score} ({info.score/info.max_score:.1%})"
+        )
+        assert (
+            info.done
+        ), f"The task is not done after applying the gold patch.\n{info.step_observation.observation}"
+
+        return info.done
diff --git a/debug_gym/agents/utils.py b/debug_gym/agents/utils.py
@@ -108,6 +108,11 @@ def load_config():
         action="store_true",
         help="Break before sending action to the environment.",
     )
+    parser.add_argument(
+        "--list",
+        action="store_true",
+        help="List available agents and problems.",
+    )
     group = parser.add_mutually_exclusive_group()
     group.add_argument(
         "-v",

diff --git a/debug_gym/gym/envs/__init__.py b/debug_gym/gym/envs/__init__.py
@@ -2,18 +2,20 @@
 from debug_gym.gym.envs.env import RepoEnv, TooledEnv
 from debug_gym.gym.envs.mini_nightmare import MiniNightmareEnv
 from debug_gym.gym.envs.swe_bench import SWEBenchEnv
+from debug_gym.gym.envs.swe_smith import SWESmithEnv
 
 
-def select_env(env_type: str = None):
+def select_env(env_type: str = None) -> type[RepoEnv]:
     match env_type:
         case None:
             return RepoEnv
         case "aider":
             return AiderBenchmarkEnv
         case "swebench":
             return SWEBenchEnv
+        case "swesmith":
+            return SWESmithEnv
         case "mini_nightmare":
             return MiniNightmareEnv
         case _:
             raise ValueError(f"Unknown benchmark {env_type}")
-    return env_class
diff --git a/debug_gym/gym/envs/configs/swe_smith.yaml b/debug_gym/gym/envs/configs/swe_smith.yaml
diff --git a/debug_gym/gym/envs/env.py b/debug_gym/gym/envs/env.py
@@ -252,11 +252,17 @@ def set_entrypoints(self, entrypoint, debug_entrypoint):
     @staticmethod
     def _prepare_entrypoint(entrypoint):
         entrypoint_list = entrypoint.split()
-
-        if entrypoint_list[0] != "python":
+        # Handle uv package manager's run command by ensuring the correct interpreter path
+        # and explicitly adding 'python' to the execution chain for consistency.
+        if entrypoint_list[0].endswith("uv") and entrypoint_list[1] == "run":
+            entrypoint_list[2] = f"$(which {entrypoint_list[2]})"
+            entrypoint_list = entrypoint_list[:2] + ["python"] + entrypoint_list[2:]
+
+        # For non-python commands, ensure we have the absolute path to the Python executable
+        # and explicitly run it through Python for consistent execution behavior.
+        elif entrypoint_list[0] != "python":
             entrypoint_list[0] = f"$(which {entrypoint_list[0]})"
             entrypoint_list = ["python"] + entrypoint_list
-            entrypoint = entrypoint_list
 
         entrypoint = " ".join(entrypoint_list)
         return entrypoint
@@ -489,6 +495,11 @@ def patch(self):
         patch = result.stdout.replace(str(self.working_dir), str(self.path))
         return patch
 
+    def apply_gold_patch(self):
+        raise NotImplementedError(
+            f"apply_gold_patch is not implemented for {self.__class__.__name__}."
+        )
+
     def step(self, action: ToolCall) -> EnvInfo:
         # given action, return new obs, and update infos
         # the action space is composed of a few smaller action spaces
Original file line number	Diff line number	Diff line change
		@@ -0,0 +1 @@
		include debug_gym/envs/configs/*.yaml
MarcCote marked this conversation as resolved. Show resolved Hide resolved