Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion .github/workflows/tests.yml
Original file line number Diff line number Diff line change
Expand Up @@ -33,7 +33,7 @@ jobs:
- name: Test with pytest - PR
if: github.event_name == 'pull_request'
run: |
DEBUG_GYM_DEBUG=1 pytest -vv -n 16 -k "not test_swe_bench" --cov=debug_gym --cov-report=term-missing --cov-fail-under=80 --timeout=300
DEBUG_GYM_DEBUG=1 pytest -vv -n 16 -k "not test_swe_bench and not test_swe_smith" --cov=debug_gym --cov-report=term-missing --cov-fail-under=80 --timeout=300
- name: Test with pytest
if: github.event_name != 'pull_request'
run: |
Expand Down
6 changes: 5 additions & 1 deletion CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -18,4 +18,8 @@ Added in the [analysis](https://github.com/microsoft/debug-gym/tree/main/analysi

### 2025-05-28

Improved the View tool, added the `start` and `end` arguments so the agent can specify a particular chunk of code to view.
Improved the View tool, added the `start` and `end` arguments so the agent can specify a particular chunk of code to view.

### 2025-06-11

Added support to [SWE-smith](https://swesmith.com/). Users can use the tasks shipped with the official SWE-smith package, or customized tasks generated using SWE-smith.
1 change: 1 addition & 0 deletions MANIFEST.in
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
include debug_gym/envs/configs/*.yaml
25 changes: 20 additions & 5 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -98,6 +98,7 @@ We provide the below LLM-based agents, they all have minimal design and serve th
| `debug_agent` | `pdb`, `rewrite`, `view`, `eval` | A minimal agent that dumps all available information into its prompt and queries the LLM to generate a command. |
| `rewrite_agent` | `rewrite`, `view`, `eval` | A `debug_agent` but `pdb` tool is disabled (an agent keeps rewriting). |
| `debug_5_agent` | `pdb`, `rewrite`, `view`, `eval` | A `debug_agent`, but `pdb` tool is only enabled after certain amount of rewrites. |
| `solution_agent` | `pdb`, `eval` | An oracle agent that applies a gold patch (only works with `swebench` and `swesmith` benchmarks for now). The agent checks that tests are failing before applying the patch, and passing after. It also checks that `pdb` tool can be used as expected. |

---

Expand All @@ -109,6 +110,7 @@ To demonstrate how to integrate `debug-gym` with coding tasks and repositories,
| :-: | :----- |
| `aider` | [https://github.com/Aider-AI/aider](https://github.com/Aider-AI/aider) |
| `swebench`| [https://github.com/princeton-nlp/SWE-bench](https://github.com/princeton-nlp/SWE-bench) |
| `swesmith`| [https://github.com/SWE-bench/SWE-smith](https://github.com/SWE-bench/SWE-smith) |
| `mini_nightmare` | A set of 10 hand-crafted minimal buggy code snippet where rewrite only agents have harder time to tackle. Read details [here](https://github.com/microsoft/debug-gym/blob/main/data/mini_nightmare/mini_nightmare.md). |

---
Expand All @@ -122,28 +124,41 @@ Add `-v`, `--debug` to be verbose, or to enter debug mode.
> [!WARNING]
> When using --debug, you will need to press `c` to continue after each reasoning step.

#### 3.1 Human Mode
#### 3.1 Sanity Checks

We can use the `solution_agent` to validate that your `swebench` and `swesmith` instances work as expected. This agent will apply a gold patch to the buggy code and check that the tests are failing before applying the patch, and passing after. It also checks that `pdb` tool can be used as expected.

python scripts/run.py scripts/config_swebench.yaml --agent solution_agent
python scripts/run.py scripts/config_swesmith.yaml --agent solution_agent

#### 3.2 Human Mode

We provide a human mode that enables developers to manually interact with `debug-gym`. To activate this mode, change the `llm_name` field in the `config_*.yaml` to be `"human"`. Once activated, at every step, the environment will expect a command input (in tool calling format). One can use the `Tab` key to get a list of tool calling templates and fill in any necessary arguments.

#### 3.2. Overriding Values in Config
#### 3.3. Overriding Values in Config

`-p` is a handy way to override values defined in config. For example, the below command will run rewrite_agent agent on Aider with human mode (while in config file it specifies gpt-4o).

python scripts/run.py scripts/config_aider.yaml --agent rewrite_agent -v -p rewrite_agent.llm_name="human"

#### 3.3. Debugging a Custom Repository
#### 3.4. Debugging a Custom Repository

Modify `scripts/config.yaml`, especially the `env_kwargs` to set the path and entrypoint of the custom repository. We assume there is a `.debugignore` file and a `.debugreadonly` within the repository that labels files/folders that are not seen or not editable, respectively.

As an example, we provide a buggy pytorch code repository in `data/pytorch`.

python scripts/run.py scripts/config.yaml --agent <agent name>

#### 3.4. Design Your Own Tool
#### 3.5. Debugging a Custom SWE-Smith Instance

[SWE-Smith](https://github.com/SWE-bench/SWE-smith) allows to generate new buggy code instances. Give a custom HuggingFace dataset (either local or remote) that has a similar structure as [SWE-bench/SWE-smith](https://huggingface.co/datasets/SWE-bench/SWE-smith), one can override the `-p base.env_kwargs.dataset_id=<dataset_id>` in the command line to run the agent on that dataset. For example, to run on a local dataset:

python scripts/run.py scripts/config_swesmith.yaml --agent <agent name> -p base.env_kwargs.dataset_id="path/to/local/dataset"

#### 3.6. Design Your Own Tool
`debug-gym`'s modular design makes it extensible. Users are encouraged to extend `debug-gym` to their specific usecases, for example by creating new tools that diversify an agent's action and observation spaces. For detailed instruction on designing new tools that are `debug-gym`-compatible, please refer to the [Technical Report](https://arxiv.org/abs/2503.21557).

#### 3.5. Analysis and Visualization
#### 3.7. Analysis and Visualization

We provide a set of scripts to help analyze the log files (e.g., the `.jsonl` files) generated by the agent.
- In the `analysis` folder, we provide scripts that used to generate the corresponding figures in our technical report.
Expand Down
1 change: 1 addition & 0 deletions debug_gym/agents/__init__.py
Original file line number Diff line number Diff line change
@@ -1,2 +1,3 @@
from debug_gym.agents.debug_agent import Debug_5_Agent, DebugAgent
from debug_gym.agents.rewrite_agent import RewriteAgent
from debug_gym.agents.solution_agent import AgentSolution
69 changes: 69 additions & 0 deletions debug_gym/agents/solution_agent.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,69 @@
import subprocess

from debug_gym.agents.base_agent import BaseAgent, register_agent
from debug_gym.gym.envs.swe_bench import SWEBenchEnv
from debug_gym.gym.envs.swe_smith import SWESmithEnv
from debug_gym.gym.tools.tool import ToolCall


@register_agent
class AgentSolution(BaseAgent):
name: str = "solution_agent"

def run(self, task_name=None, debug=False):
self.history.reset()

info = self.env.reset(options={"task_name": task_name})
self.history.step(info)

if info.done is True:
return True

self.logger.info(
f"Score: {info.score}/{info.max_score} ({info.score/info.max_score:.1%})"
)

# Make a simple pdb call to make sure it is working.
action = ToolCall(name="pdb", id="pdb", arguments={"command": "help help"})
pdb_help_info = self.env.step(action)
assert (
"h(elp)" in pdb_help_info.step_observation.observation
), f"PDB command did not return expected help message.\n{pdb_help_info.step_observation.observation}"

# Send a pdb continue command, and check the output matches the one from env.reset.
action = ToolCall(name="pdb", id="pdb", arguments={"command": "continue"})
pdb_continue_info = self.env.step(action)

assert (
"Reached the end of the program. Restarting the debugging session."
in pdb_continue_info.step_observation.observation
) or (
info.step_observation.observation.splitlines()[-1]
in pdb_continue_info.step_observation.observation
), f"PDB command did not return expected continue message.\n{pdb_continue_info.step_observation.observation}"

try:
self.env.apply_gold_patch()
except NotImplementedError as e:
self.logger.error(
f"The environment {type(self.env)} is not compatible with SolutionAgent"
"Check the README.md to see which environments are compatible."
)
raise

if debug:
breakpoint()

action = ToolCall(name="eval", id="eval", arguments={})
info = self.env.step(action)

self.history.step(info)

self.logger.info(
f"Score: {info.score}/{info.max_score} ({info.score/info.max_score:.1%})"
)
assert (
info.done
), f"The task is not done after applying the gold patch.\n{info.step_observation.observation}"

return info.done
5 changes: 5 additions & 0 deletions debug_gym/agents/utils.py
Original file line number Diff line number Diff line change
Expand Up @@ -108,6 +108,11 @@ def load_config():
action="store_true",
help="Break before sending action to the environment.",
)
parser.add_argument(
"--list",
action="store_true",
help="List available agents and problems.",
)
group = parser.add_mutually_exclusive_group()
group.add_argument(
"-v",
Expand Down
6 changes: 4 additions & 2 deletions debug_gym/gym/envs/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -2,18 +2,20 @@
from debug_gym.gym.envs.env import RepoEnv, TooledEnv
from debug_gym.gym.envs.mini_nightmare import MiniNightmareEnv
from debug_gym.gym.envs.swe_bench import SWEBenchEnv
from debug_gym.gym.envs.swe_smith import SWESmithEnv


def select_env(env_type: str = None):
def select_env(env_type: str = None) -> type[RepoEnv]:
match env_type:
case None:
return RepoEnv
case "aider":
return AiderBenchmarkEnv
case "swebench":
return SWEBenchEnv
case "swesmith":
return SWESmithEnv
case "mini_nightmare":
return MiniNightmareEnv
case _:
raise ValueError(f"Unknown benchmark {env_type}")
return env_class
15 changes: 15 additions & 0 deletions debug_gym/gym/envs/configs/swe_smith.yaml

Large diffs are not rendered by default.

17 changes: 14 additions & 3 deletions debug_gym/gym/envs/env.py
Original file line number Diff line number Diff line change
Expand Up @@ -252,11 +252,17 @@ def set_entrypoints(self, entrypoint, debug_entrypoint):
@staticmethod
def _prepare_entrypoint(entrypoint):
entrypoint_list = entrypoint.split()

if entrypoint_list[0] != "python":
# Handle uv package manager's run command by ensuring the correct interpreter path
# and explicitly adding 'python' to the execution chain for consistency.
if entrypoint_list[0].endswith("uv") and entrypoint_list[1] == "run":
entrypoint_list[2] = f"$(which {entrypoint_list[2]})"
entrypoint_list = entrypoint_list[:2] + ["python"] + entrypoint_list[2:]

# For non-python commands, ensure we have the absolute path to the Python executable
# and explicitly run it through Python for consistent execution behavior.
elif entrypoint_list[0] != "python":
entrypoint_list[0] = f"$(which {entrypoint_list[0]})"
entrypoint_list = ["python"] + entrypoint_list
entrypoint = entrypoint_list

entrypoint = " ".join(entrypoint_list)
return entrypoint
Expand Down Expand Up @@ -489,6 +495,11 @@ def patch(self):
patch = result.stdout.replace(str(self.working_dir), str(self.path))
return patch

def apply_gold_patch(self):
raise NotImplementedError(
f"apply_gold_patch is not implemented for {self.__class__.__name__}."
)

def step(self, action: ToolCall) -> EnvInfo:
# given action, return new obs, and update infos
# the action space is composed of a few smaller action spaces
Expand Down
Loading