Resolve judge code smell #102

jgieringer · 2026-02-07T01:27:57Z

Description

Move the judge CLI (get_parser, main) from the root script into the package so tests can import it normally.

Add judge/cli.py with get_parser() and async main(args) (unchanged behavior).
Slim judge.py to a thin entrypoint that delegates to judge.cli.
Update tests/unit/judge/test_judge_cli.py to use from judge.cli import get_parser, main and drop the importlib + Path(file).parents[3] hack.
Update run_pipeline.py to use from judge.cli import main as judge_main instead of loading the script by path.
Update tests/integration/test_pipeline.py to patch judge.cli.main instead of importlib.util.spec_from_file_location / module_from_spec.

Resolves the concern that tests were hacking script location; tests now rely on normal imports.

Copilot

Pull request overview

Refactors the judge CLI entrypoint into the judge package so it can be imported normally (improving testability and removing path-based import hacks).

Changes:

Added judge/cli.py containing get_parser() and async main(args).
Converted root judge.py into a thin entrypoint delegating to judge.cli.
Updated pipeline + tests to import/patch judge.cli directly instead of loading judge.py via importlib + file paths.

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 2 comments.

Show a summary per file

File	Description
`judge/cli.py`	New module that houses the judge CLI parser + async main logic.
`judge.py`	Now delegates to `judge.cli` and runs the async entrypoint.
`run_pipeline.py`	Imports `judge.cli.main` directly instead of loading `judge.py` by path.
`tests/unit/judge/test_judge_cli.py`	Uses normal imports from `judge.cli` and patches module objects directly.
`tests/integration/test_pipeline.py`	Patches `judge.cli.main` directly and removes importlib-based mocking.

Comments suppressed due to low confidence (1)

run_pipeline.py:195

The comment says imports are deferred "to allow --debug flag to be set", but judge.cli is imported before set_debug(True) runs. Either move the generate_main/judge_main imports to after the debug flag handling, or update the comment so it doesn’t claim behavior that isn’t true.

    # Import generate and judge main functions
    # We import here to avoid circular dependencies and to allow --debug flag to be set
    from generate import main as generate_main
    from judge.cli import main as judge_main

    # Set debug mode if flag is provided
    if args.debug:
        from utils.debug import set_debug

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-02-07T01:31:53Z

judge/cli.py

+            "Default: temperature=0 (unless overridden)"
+        ),
+        type=parse_key_value_list,
+        default={},


--judge-model-extra-params uses a mutable dict (default={}) as the argparse default. Because LLMJudge mutates this dict (e.g., sets temperature), subsequent parse_args() calls in the same process can inherit mutated defaults. Use default=None (and normalize to {} in main) or ensure a fresh dict per parse (e.g., copy).

Suggested change

default={},

default=None,

Copilot · 2026-02-07T01:31:54Z

judge/cli.py

+
+    # Load rubric configuration once at startup
+    print("📚 Loading rubric configuration...")
+    rubric_config = await RubricConfig.load(rubric_folder="data")


The CLI exposes --rubrics (and run_pipeline.py passes rubrics=args.rubrics), but main() always loads RubricConfig from the hard-coded rubric_folder="data" and never uses args.rubrics. This makes the flag ineffective and can lead to confusing behavior. Either wire args.rubrics into RubricConfig.load (e.g., derive folder/filename from the provided path) or remove/rename the flag if it’s not supported.

Suggested change

rubric_config = await RubricConfig.load(rubric_folder="data")

rubric_folder = getattr(args, "rubrics", None) or "data"

rubric_config = await RubricConfig.load(rubric_folder=rubric_folder)

refactor judge to reduce hacky imports

b489be7

jgieringer changed the base branch from main to jgieringer/unit-testing February 7, 2026 01:28

jgieringer requested review from Copilot and removed request for Copilot February 7, 2026 01:28

Copilot started reviewing on behalf of jgieringer February 7, 2026 01:29 View session

jgieringer mentioned this pull request Feb 7, 2026

Unit Test Audit #100

Merged

Base automatically changed from jgieringer/unit-testing to main February 7, 2026 01:29

jgieringer requested review from Copilot, emily-vanark and sator-labs February 7, 2026 01:30

Copilot started reviewing on behalf of jgieringer February 7, 2026 01:30 View session

Copilot AI reviewed Feb 7, 2026

View reviewed changes

jgieringer mentioned this pull request Feb 7, 2026

Judge integration test #101

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Resolve judge code smell #102

Resolve judge code smell #102

Uh oh!

jgieringer commented Feb 7, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Feb 7, 2026

Uh oh!

Copilot AI Feb 7, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

	rubric_config = await RubricConfig.load(rubric_folder="data")
	rubric_folder = getattr(args, "rubrics", None) or "data"
	rubric_config = await RubricConfig.load(rubric_folder=rubric_folder)

Resolve judge code smell #102

Are you sure you want to change the base?

Resolve judge code smell #102

Uh oh!

Conversation

jgieringer commented Feb 7, 2026

Description

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Feb 7, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 7, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant