This repo provides the extended environments for CoMLRL.
Install CoMLRL:
pip install comlrl
# Install PyTorch compatible with your deviceOr via conda-forge:
conda install -c conda-forge comlrl
# Install PyTorch compatible with your device- Contents: each sample includes a class skeleton, method stubs (with docstrings
or
pass), and canonical hidden tests. - Splitting:
train/train_magrpo.pyloads explicit HF slices fromdataset.train_splitanddataset.eval_split(e.g.,test[:50]andtest[50:]). - Subsetting: if a split name is missing (e.g., ClassEval only has
test), the loader falls back to the first available split before slicing. - Prompting: prompts include the sanitized class skeleton, explicit method names per agent, and any collaboration instructions.
- Testing: reward code merges agent completions back into the skeleton and runs the provided unit tests inside a temporary directory to isolate state.
Key sections in configs/magrpo_classeval_config.yaml:
model: base checkpoint (Qwen/Qwen2.5-Coder-3B-Instructby default), tokenizer/model kwargs, and device mapping.dataset: dataset name and split strings (train_split,eval_split) for ClassEval sub-slices or local mirrors.external: determines the feedback mode.token_reportsummarizes syntax/tests at each turn; other modes replicate the options documented in the code-generation README (plain,level_feedback,group_feedback,personal_feedback,personal_detailed_feedback,passed,level_passed).magrpo: forwarded tocomlrl.trainers.magrpo.MAGRPOTrainer. Includes collaboration (num_agents, TAKE_JOB self-select), sampling settings (num_generations,num_turns, temperature/top_p), rollout buffering (rollout_buffer_size), optimization hyperparameters, and IO controls.output: persistence knobs (save final model, keep tmp dirs); environment variables such asCLASSEVAL_TMP_BASEare derived from this section to colocate temp files per job.
rewards/CE_reward.pycomputes structured rewards:lv1: coverage of unique methods completed.lv2: penalizes under/over-allocation of total method picks.lv3: balance term encouraging an even workload across agents.lv4/lv5: syntax + unit-test bonuses (reported for analysis; syntax/test failures short-circuit the run where applicable).
- Tests execute inside per-sample temporary directories to avoid polluted state and are automatically truncated on timeout.
- Loggers are inherited from CoMLRL. Enable Weights & Biases by filling
wandb.entityor disable it for offline debugging.
