Brief for AI coding agents working on RLinf. For full contribution flow, code style, and PR process see CONTRIBUTING.md.
Quick orientation: RLinf is a distributed RL stack (embodied + reasoning + agent). It uses Ray for process management and Hydra for config. Single-machine runs use cluster.num_nodes: 1; multi-node needs Ray started on every node with RLINF_NODE_RANK set before ray start. Pre-commit runs Ruff (lint + format) and commit-check; use Google-style docstrings and type hints. All user-facing changes need tests and docs. If something is unclear, add a TODO(agent) and note the limitation.
.cursor/– Rules and skills:rules/agents-md.mdc,skills/add-install-docker-ci-e2e,skills/add-example-doc-model-env,skills/review-pr.rlinf/– Main package:agents/– Agent logic (reasoning, tools).algorithms/– Advantages, losses, registry, rewards (math, code, searchr1, vqa).config.py– Hydra config,SupportedModel,SupportedEnvType, validation.data/– Datasets for embodied, reasoning, agent.envs/– ManiSkill, LIBERO, IsaacLab, CALVIN, MetaWorld, Behavior, RoboCasa, FrankaSim, RealWorld, RoboTwin, Habitat, OpenSora world model;get_env_cls()inenvs/__init__.py.hybrid_engines/– SGLang/vLLM rollout integration.models/– Embodiment (OpenVLA, OpenVLA-OFT, OpenPI, GR00T, MLP/CNN/Flow/CMA) and reasoning wiring.runners/– Embodied (sync/async), reasoning, coding_online_rl, agent, SFT, eval.scheduler/– Cluster, Worker, WorkerGroup, channel, manager, placement, dynamic_scheduler.utils/– Logging, placement, data iter, distributed, checkpoint, resharding.workers/– Actor (FSDP/Megatron), rollout (HF/server), env (sync/async), reward, replay buffer.
examples/– Entrypoints and YAML: embodiment, reasoning, coding_online_rl, searchr1, sft, wideseek_r1.tests/–unit_tests/,e2e_tests/(embodied, agent, reasoning), scheduler tests; e2e configs undere2e_tests/embodied/*.yaml.requirements/–install.sh(targets: embodied, reason, docs;--model,--env), optional deps in subdirs.docker/– Dockerfile and build targets per model/env.ray_utils/–start_ray.sh(multi-node head/worker),check_ray.sh,realworld/setup_before_ray.sh.toolkits/– Checkpoint converters, verifiers, eval scripts, replay buffer, auto-placement.docs/– Sphinx RST (EN/ZH): start, tutorials, examples, APIs, FAQ.
You launch one entry script (e.g. train_embodied_agent.py, train_async.py). It builds a Cluster (Ray must already be up), figures component placement (actor, rollout, env, reward, agent), and starts Worker groups. A Runner drives the loop: rollout → reward → advantage → actor update (and any inference/engine lifecycle). Cluster config lives in YAML under cluster:: num_nodes, component_placement, node_groups (labels, node_ranks, env_configs, optional hardware e.g. Franka). Placement (e.g. HybridComponentPlacement, ModelParallelComponentPlacement) maps components to node groups and hardware ranks. Workers are Ray remote actors with MASTER_*, RANK, etc.; they can send/recv across groups. Training backends: FSDP or Megatron. Rollout: SGLang or vLLM. Runners pick loss/advantage from config (PPO, GRPO, SAC, etc.).
Single machine: Install via Docker or bash requirements/install.sh embodied --model <model> --env <env> (set REPO_PATH and any asset paths). Ray may auto-start; or run ray start --head. Use a config with cluster.num_nodes: 1 (e.g. from examples/embodiment/config/). Launch with bash examples/embodiment/run_embodiment.sh <config_name> or python examples/embodiment/train_embodied_agent.py --config-name <config_name>, and set env vars the example needs (e.g. MUJOCO_GL=egl, ROBOT_PLATFORM).
Multiple machines: On each node, before ray start: set export RLINF_NODE_RANK=<0..N-1> (unique) and optionally RLINF_COMM_NET_DEVICES. Head: ray start --head --port=6379 --node-ip-address=<head_ip>. Workers: ray start --address=<head_ip>:6379. You can use ray_utils/start_ray.sh. Set cluster.num_nodes to the total; optionally use node_groups and component_placement (see rlinf/scheduler/cluster/config.py and the heterogeneous cluster tutorial). Run the entry script only on the head; it attaches to the existing Ray cluster and schedules workers by placement.
- Placement and throughput: Configure
cluster.component_placement(collocated vs disaggregated vs hybrid, node groups, hardware ranks). See placement tutorial and execution modes. - OOM: Tune env (
total_num_envs,group_size), rollout (batch/seq,gpu_memory_utilization,enable_offload), actor (micro_batch_size,global_batch_size,gradient_checkpointing,enable_offload). Example configs inexamples/embodiment/config/. See FAQ for SGLang/memory issues. - Multi-node and hetero: Set
cluster.num_nodes; setRLINF_NODE_RANK(and optionallyRLINF_COMM_NET_DEVICES) beforeray starton each node—Ray captures env at start time. Optionalnode_groupsandcomponent_placementin YAML;env_configs(e.g.env_vars,python_interpreter_path) are applied at worker allocation. See heterogeneous cluster andrlinf/scheduler/cluster/config.py.
- Metrics: Runners use
MetricLogger; setrunner.logger.logger_backends(e.g. tensorboard, wandb, swanlab). Namespaces includetrain/,eval/,env/,rollout/,time/. See logger tutorial. - Checkpoints: Saved every
runner.save_intervalunder.../checkpoints/global_step_<N>/. To resume, setrunner.resume_dirto that path and relaunch; some runners supportresume_dir: auto. See checkpoint resume tutorial. - Evaluation: During training,
runner.val_check_intervaltriggers validation. Standalone embodied:bash examples/embodiment/eval_embodiment.sh <config_name>with an eval config; see VLA evaluation. Reasoning/LLM: see LLM evaluation.
For debugging (breakpoints, rendering/EGL, network, NCCL/CUDA, timeouts), see the FAQ in Further reading.
Config (rlinf/config.py): build_config / validate_cfg produce the full DictConfig. New model or env types go into SupportedModel / SupportedEnvType and validation.
Cluster and placement: ClusterConfig and strategies in rlinf/scheduler/placement/, rlinf/utils/placement.py. Placement controls where actor/rollout/env run (one node vs many, GPU vs CPU, heterogeneous).
Algorithms: Advantage and loss functions are registered in rlinf/algorithms/ (registry + decorators); rewards are registered in rlinf/algorithms/rewards/. Config keys algorithm.adv_type and algorithm.loss_type select them. See Extending RLinf: algorithms, models, envs for step-by-step instructions.
Models (embodied): Register in SupportedModel in config.py, implement under rlinf/models/embodiment/<name>/ (e.g. BasePolicy), wire in config and workers. Use add-install-docker-ci-e2e for install/Docker/CI. Details in the extension section below.
Environments: Register in SupportedEnvType and get_env_cls() in rlinf/envs/__init__.py, implement under rlinf/envs/<name>/. Use add-install-docker-ci-e2e and add-example-doc-model-env for install and docs. Details below.
Workers: Subclass Worker, implement initialize and your API, launch with create_group(...).launch(...). Use self.log_info / log_warning / log_error; no print.
Runners: They own the training loop. New task type = new runner + entry script that builds Cluster, placement, worker groups, and calls the runner.
Advantage function
- Implement a function that takes the same keyword args as existing ones (e.g.
rewards,values,dones,gamma,loss_mask, …) and returns(advantages, returns). Seerlinf/algorithms/advantages.py(e.g.compute_gae_advantages_and_returns) for signatures. - Register it:
from rlinf.algorithms.registry import register_advantagethen@register_advantage("my_adv")on your function. The name is case-normalized to lowercase. - In config YAML set
algorithm.adv_type: my_adv. Actor workers callcalculate_adv_and_returns(adv_type=...)which dispatches viaget_adv_and_returns(name). - For non-GAE styles (e.g. GRPO, Reinforce++),
rlinf/algorithms/utils.pymay need to compute scores first; check howadv_typeis used incalculate_adv_and_returnsand in the actor worker.
Policy loss
- Implement a function that accepts the kwargs passed by the actor (e.g.
logprobs,old_logprobs,advantages,clip_ratio_low,clip_ratio_high,loss_mask, …) and returns(loss_tensor, metrics_dict). Seerlinf/algorithms/losses.py(e.g.compute_ppo_actor_loss,compute_ppo_actor_critic_loss,compute_grpo_actor_loss_fn). - Register:
from rlinf.algorithms.registry import register_policy_lossthen@register_policy_loss("my_loss"). - In config set
algorithm.loss_type: my_loss. For PPO-style actor+critic you need a critic and value loss; the unified entry ispolicy_loss(loss_type=..., **kwargs)inregistry.py. Add validation inrlinf/config.pyif your loss has special requirements (e.g.validate_cfgalready checksloss_type == "actor_critic"for value head).
Reward
- Add a reward class (e.g. under
rlinf/algorithms/rewards/<domain>/) that matches the interface expected by the reward worker (e.g. callable or class with a clear contract for prompt/completions/ids). - In
rlinf/algorithms/rewards/__init__.py: import the class, thenregister_reward("my_reward", MyRewardClass). The registry isreward_registry; lookup viaget_reward_class(name). - Wire the reward name in config and in the runner/reward worker so the correct class is instantiated and used. For reasoning/agent tasks the config path may be under
reward.pathor similar.
- Registration: In
rlinf/config.py, add a new value to theSupportedModelenum:MY_MODEL = ("my_model", "embodied"). Useget_supported_model(model_type)in validation somodel.model_type: my_modelis accepted. - Implementation: Create a package under
rlinf/models/embodiment/my_model/. For policies that fit the embodied actor interface, inherit fromrlinf.models.embodiment.base_policy.BasePolicyand implementdefault_forwardandpredict_action_batch; add other forward types (e.g.sac_forward,crossq_forward) if the algorithm needs them. For HuggingFace-based VLAs, follow the pattern in the docs: register config and processor inrlinf/models/__init__.py(get_model_config_and_processor), then implement an action model that wraps generation and optional value head. - Config and workers: Ensure
build_config/ default configs provide the rightmodel.model_type, checkpoint paths, and any model-specific options. Actor and rollout workers already branch oncfg.actor.model.model_type/cfg.rollout.model.model_type; add branches or a factory so your model is instantiated and used. For FSDP+HuggingFace, see the new model (FSDP) tutorial; for Megatron there is a separate new model (Megatron) tutorial. - Install and CI: If the model needs extra deps or a dedicated venv, add it to
requirements/install.sh(e.g.SUPPORTED_MODELS, and aninstall_my_model()or branch in the model switch). For Docker and e2e: use the skill.cursor/skills/add-install-docker-ci-e2e(install script, Dockerfile stage, CI job, e2e config undertests/e2e_tests/embodied/).
- Registration: In
rlinf/envs/__init__.py, add a member toSupportedEnvType: e.g.MY_ENV = "my_env". Inget_env_cls(env_type, env_cfg=None, ...)add anelif env_type == SupportedEnvType.MY_ENV:branch that imports your env class and returns it (lazy import to avoid loading heavy deps at import time). If the env needs a task id (like IsaacLab), useenv_cfgand document the expected shape. - Implementation: Create
rlinf/envs/my_env/with at least one module defining a gym-style env (e.g.gymnasium.Env):reset,step, and the usual attributes (observation_space,action_space). Follow the new environment tutorial for the expected structure (e.g. vectorizednum_envs,group_size,ret_device). If your env uses custom action formatting, add a branch inrlinf/envs/action_utils.pyinprepare_actions(env_type, ...)so rollout/workers pass correctly shaped actions. - Config: Set
env.train.env_typeandenv.eval.env_typeto the string value of your enum (e.g.my_env). Add any env-specific defaults or validation inrlinf/config.py(e.g.validate_cfgalready has env-specific checks for ManiSkill, Behavior, etc.; add similar ones if needed). - Install and docs: For install/Docker/CI, use
.cursor/skills/add-install-docker-ci-e2e(add env toSUPPORTED_ENVS, install logic, e2e config). For example docs and RST, use.cursor/skills/add-example-doc-model-env.
Google Python style; Ruff for lint/format; docstrings and type hints on public APIs. Logging: rlinf.utils.logging.get_logger() or Workers’ self.log_*. Config YAML: static values only; no computed fields; don’t overwrite user-facing fields in code. Commits: Conventional Commits, ~72-char subject, imperative; every commit Signed-off-by: (e.g. git commit -s). PRs: same title format, fill template, link issues; for perf-sensitive changes include test results. New behavior needs tests (unit or e2e); if e2e needs GPUs/hardware, document and skip appropriately in CI. Full details: CONTRIBUTING.md.
- Docs (EN) · 中文
- Installation · VLA quickstart
- Example gallery · configs in
examples/embodiment/config/,examples/reasoning/, etc. - Tutorials: placement / cluster / YAML, hybrid / disaggregated, heterogeneous cluster, extend (new env/model), RL algorithms, logger (metrics), checkpoint resume
- Evaluation: VLA evaluation · LLM evaluation
- APIs (actor, channel, cluster, placement, worker, env, data, …) · FAQ