-
Notifications
You must be signed in to change notification settings - Fork 6
feat: add end-to-end pipeline script for generation → evaluation → scoring #91
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Create run_pipeline.py to orchestrate the complete workflow: generation → evaluation → scoring. The script imports and calls main() functions directly from generate.py and judge.py, then calls score_results() from judge/score.py. Changes: - Add run_pipeline.py: ~270 lines, orchestrates all three stages - Modify generate.py main() to return (results, folder_name) tuple - Modify judge.py main() to return Optional[str] output folder path - Modify judge_conversations() to return (results, output_folder) tuple - Update README.md with pipeline usage documentation Benefits: - Single command replaces three-step manual process - Clean Python code with direct function imports (no subprocess) - Native return values (no stdout parsing or temp files) - Standard async/await patterns - Easy to test and maintain All changes are backwards compatible - CLI scripts work unchanged. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Add comprehensive integration tests for the end-to-end pipeline script: - Argument parsing and validation tests - Configuration building from arguments - Data flow between pipeline stages (generation → evaluation → scoring) - Extra parameters handling for all models - Path construction and passing between stages Tests verify that run_pipeline.py correctly orchestrates the three-stage workflow and properly transforms arguments for each stage. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Fix 15 test failures caused by breaking API change in judge_conversations(), which now returns tuple (results, output_folder) instead of just results. Changes: - Update 13 tests in test_evaluation_runner.py to unpack tuple - Update 2 tests in test_runner_extra_params.py to unpack tuple - All 510 tests now pass (previously 495 passing, 15 failing) This change aligns tests with the new API introduced to support run_pipeline.py, which needs both results and output folder path for pipeline orchestration. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Add three missing arguments to run_pipeline.py to match generate.py and judge.py individual script capabilities: - --run-id: Allow custom run identifiers (was only in generate.py) - --rubrics: Support custom rubric files (was hardcoded) - --judge-output: Control evaluation output folder (was hardcoded) All arguments are optional with sensible defaults for backward compatibility. This makes the pipeline script a proper superset of individual scripts. Changes: - Add argument parsing for --run-id, --rubrics, --judge-output - Pass run_id to generate_main() instead of relying on default - Pass rubrics to judge instead of hardcoded ["data/rubric.tsv"] - Pass judge_output instead of hardcoded "evaluations" - Update test fixture to include new arguments - Add 11 new test cases covering argument parsing and defaults All 520 tests pass. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Add test cases to verify that short flags work correctly for extra params and run-id arguments: - test_short_flags_for_extra_params: Verify -uep, -pep, -jep work - test_short_flag_for_run_id: Verify -i works for --run-id These tests ensure compatibility with generate.py and judge.py short flag conventions. All 522 tests pass. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Update test_parse_arguments_with_all_optional_arguments to include the newly added arguments: - --run-id - --rubrics (with multiple values) - --judge-output This ensures the test covers all optional arguments available in run_pipeline.py. All 522 tests pass. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Update test_judge_args_namespace_structure to use pipeline_args values instead of hardcoded defaults for rubrics and judge_output. This makes the test more maintainable and ensures it reflects the actual default values from the fixture. All 522 tests pass. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This PR adds a unified pipeline orchestration script (run_pipeline.py) that automates the complete VERA-MH workflow: conversation generation → LLM evaluation → scoring and visualization. The script eliminates the need for users to manually run three separate commands and copy output paths between stages.
Changes:
- Added
run_pipeline.pywith comprehensive argument support matching individual scripts - Modified
generate.py,judge.py, andjudge/runner.pyto return tuples containing both results and output folder paths - Updated 15 existing tests to handle new tuple return types and added comprehensive pipeline integration tests
Reviewed changes
Copilot reviewed 8 out of 8 changed files in this pull request and generated 6 comments.
Show a summary per file
| File | Description |
|---|---|
| run_pipeline.py | New unified pipeline orchestration script with argument parsing and stage coordination |
| generate.py | Modified main() to return tuple (results, folder_path) for pipeline chaining |
| judge.py | Modified main() to return Optional[str] output folder path for pipeline chaining |
| judge/runner.py | Modified judge_conversations() to return tuple (results, output_folder) |
| tests/integration/test_pipeline.py | Comprehensive integration tests for argument parsing and configuration building (593 lines) |
| tests/integration/test_evaluation_runner.py | Updated 13 tests to unpack tuple return values |
| tests/unit/judge/test_runner_extra_params.py | Updated 2 tests to unpack tuple return values |
| README.md | Added documentation for the new pipeline script with usage examples |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
run_pipeline.py
Outdated
| type=int, | ||
| help="Maximum number of personas to load (for testing)", | ||
| ) | ||
| parser.add_argument("--folder-name", help="Custom folder name for conversations") |
Copilot
AI
Jan 22, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The argument --folder-name is missing its short flag -f which is available in generate.py. This creates an inconsistency between the individual script and the pipeline script. Users familiar with generate.py might expect to use -f here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
agreed - What where the f! 😉
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thinking about it, now i do wonder if the runner script should be the ONLY entry point, and then it might call sub parts of it. that would prevent the problem with having to manually sync args
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
but a discussion for another day
run_pipeline.py
Outdated
| score_results( | ||
| results_csv_path=results_csv, | ||
| personas_tsv_path=args.personas_tsv, | ||
| skip_risk_analysis=args.skip_risk_analysis, | ||
| ) |
Copilot
AI
Jan 22, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The pipeline should handle potential errors from score_results to provide better error messages. If scoring fails, the error should be caught and a user-friendly error message should be displayed. This would help users understand what went wrong in the final stage.
The score_results function only accepts results_csv_path and output_json_path parameters. The pipeline was incorrectly passing personas_tsv_path and skip_risk_analysis arguments that don't exist. Changes: - Import all necessary scoring functions at top of file - Call score_results() for standard analysis and visualization - Conditionally call score_results_by_risk() for risk-level analysis - Properly handle skip_risk_analysis flag This matches the actual judge/score.py API where: - score_results() does standard analysis (no risk levels) - score_results_by_risk() does risk-level analysis (requires personas_tsv) Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
The issue was that `from judge import main` was trying to import from the judge/ package (__init__.py), but the main() function is in the judge.py module file at the root level. Fixed by using importlib to explicitly load judge.py as a module, avoiding the package/module name collision. This allows the pipeline to correctly import and call judge.main(). Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Add informative log message explaining that single conversation mode doesn't return an output folder, clarifying the intent behind the existing comment and making the behavior more visible to users. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Add comprehensive test suite for run_pipeline.py validation logic that verifies proper error handling when Steps 1 or 2 produce empty output folders. Test Coverage: - Step 1 validation: folder existence, conversation files, log-only files - Step 2 validation: folder return, existence, results.csv presence - Error message verification including file listing - Success path validation messages Implementation: - 8 new test cases in TestPipelineValidation class - Mock-based approach using unittest.mock.patch - SystemExit handling via pytest.raises(SystemExit) - Output capture using capsys fixture for message verification - valid_pipeline_args fixture for test reusability Impact: - run_pipeline.py coverage: 5% → 96% - All 530 tests passing (522 existing + 8 new) - Overall project coverage: 75% Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
Copilot reviewed 8 out of 11 changed files in this pull request and generated 3 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
| preview = msg.content[:100] | ||
| content_preview = preview + "..." if len(msg.content) > 100 else msg.content | ||
| debug_print(f" {i+1}. {msg_type}: {content_preview}") | ||
| debug_print(f" {i + 1}. {msg_type}: {content_preview}") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
that feels illegal
Replace all occurrences of the deprecated model name 'claude-3-5-sonnet-20241022' with 'claude-sonnet-4-5-20250929' to fix 404 errors in the pipeline. The old model is no longer available in Anthropic's API. Changes: - Update default model in llm_clients/config.py - Update all documentation examples (README.md, help text) - Update model_config.json with new default and add model entry - Update all test files to use new model name - Update fallback defaults in utils/model_config_loader.py This fixes the pipeline failure where judge evaluations were failing with 404 "model not found" errors. Verified: - Unit tests pass (test_config.py, test_claude_llm.py) - Integration tests updated and pass - Code quality checks pass (ruff format, ruff check) Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
…to feat/single_script
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
Copilot reviewed 17 out of 20 changed files in this pull request and generated no new comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| "assistant": "claude-sonnet-4-5-20250929", | ||
| "philosopher": "claude-3-opus-20240229", | ||
| "debate_starter": "claude-3-sonnet-20240229", | ||
| "creative": "claude-3-haiku-20240307", | ||
| "scientist": "claude-3-5-sonnet-20241022", | ||
| "skeptic": "claude-3-5-sonnet-20241022", | ||
| "scientist": "claude-sonnet-4-5-20250929", | ||
| "skeptic": "claude-sonnet-4-5-20250929", | ||
| "gpt_assistant": "gpt-4", | ||
| "gpt_creative": "gpt-4-turbo", | ||
| "gpt_analyst": "gpt-3.5-turbo", | ||
| "claude-sonnet-4-20250514": "claude-sonnet-4-20250514" | ||
| "claude-sonnet-4-20250514": "claude-sonnet-4-20250514", | ||
| "claude-sonnet-4-5-20250929": "claude-sonnet-4-5-20250929" | ||
| }, | ||
| "default_model": "claude-3-5-sonnet-20241022" | ||
| "default_model": "claude-sonnet-4-5-20250929" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@sator-labs should this file be taken out? If so, maybe a different PR
I don't see this model_config.json used anywhere but test_model_config_loader.py
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good point. this file needs a little cleaning
like the above roles are from a very old version and not used a ymore
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good catch. actually i think the file should be removed entierly. let's do another one after?
jgieringer
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
Curious though about https://github.com/SpringCare/VERA-MH/pull/91/files#r2722845361 for a later work item
Summary
Adds a unified Python script (
run_pipeline.py) that orchestrates the complete VERA-MH workflow in a single command: conversation generation → LLM evaluation → scoring and visualization.This PR consolidates the entire pipeline into a pure Python implementation with comprehensive argument support matching the individual scripts.
Motivation
Previously, users had to manually run three separate commands and copy output paths between stages:
This was error-prone and tedious. The new pipeline script automates this completely:
# New way - automatic path management python3 run_pipeline.py \ --user-agent claude-3-5-sonnet-20241022 \ --provider-agent gpt-4o \ --runs 2 --turns 10 \ --judge-model claude-3-5-sonnet-20241022 \ --max-personas 5Changes
Core Implementation
run_pipeline.py- Unified Python pipeline script (295 lines)--run-id,--max-personas,--max-total-words, etc.--rubrics,--judge-output,--judge-max-concurrent, etc.-uep,-pep,-jep,-i(matching individual scripts)API Changes for Pipeline Support
generate.py::main()now returnstuple[results, folder_path]judge.py::main()now returnsOptional[str](output folder path)judge/runner.py::judge_conversations()now returnstuple[results, output_folder]These changes enable the pipeline to automatically chain stages without manual path copying.
Testing
tests/integration/test_pipeline.py- Comprehensive integration tests (593 lines)Test Fixes - Updated 15 tests to handle new tuple return types
tests/integration/test_evaluation_runner.pytests/unit/judge/test_runner_extra_params.pyArgument Consistency
The pipeline now supports all arguments from both individual scripts:
From generate.py
--user-agent(-u),--provider-agent(-p)--runs(-r),--turns(-t)--user-agent-extra-params(-uep),--provider-agent-extra-params(-pep)--max-total-words(-w),--max-concurrent,--max-personas--folder-name,--run-id(-i),--debugFrom judge.py
--judge-model(-j),--judge-model-extra-params(-jep)--rubrics,--judge-output--judge-max-concurrent,--judge-per-judge,--judge-limit--judge-verbose-workersAdditional scoring arguments
--skip-risk-analysis,--personas-tsvAll short flags are consistent with the individual scripts for a seamless user experience.
Usage Examples
Basic Pipeline Run
With Custom Run ID and Output Folders
With Multiple Rubrics
With Model Parameters (using short flags)
Testing with Limited Personas
Output
The script provides clear progress tracking and a final summary:
Breaking Changes
generate.py::main():List[Dict]→tuple[List[Dict], str]judge/runner.py::judge_conversations():List[Dict]→tuple[List[Dict], str]All existing tests have been updated to handle these changes. External code calling these functions will need to update to unpack the tuple:
Test Results
Files Changed
run_pipeline.py(295 lines) - Python pipeline orchestrationtests/integration/test_pipeline.py(593 lines) - Comprehensive pipeline testsgenerate.py- Return tuple with folder pathjudge.py- Return output folder pathjudge/runner.py- Return tuple with output folderMigration Notes
No migration needed! All new arguments are optional with sensible defaults. Existing commands continue to work unchanged.
New capabilities available:
--run-id custom_id- Set custom run identifier--rubrics file1.tsv file2.tsv- Use multiple rubric files--judge-output custom_folder- Change evaluation output locationChecklist
Related Issues
Closes #[issue-number] (if applicable)
Co-Authored-By: Claude Sonnet 4.5 noreply@anthropic.com