Can evolutionary model merging, applied to LLMs (Akiba et al., 2025), be transferred to physical World Models?
This project tests that hypothesis using two 1D physical phenomena — heat diffusion and Burgers equation — as elementary World Models, and advection-diffusion as the compound target phenomenon that neither model can represent alone.
Akiba et al. (2025) demonstrated that merging LLMs specialized in different domains (Japanese language + math reasoning) via evolutionary search over weight combinations produces a model with emergent cross-domain capability, without any additional training. The key enabler was a shared base model (Mistral-7B-v0.1) guaranteeing latent space compatibility.
This project asks: can the same principle apply when the "domains" are physical phenomena governed by different PDEs?
The structural analogy:
| LLM merging (original) | Physics WM merging (this project) |
|---|---|
| Mistral-7B-v0.1 base | JEPA physics encoder (pretrained, frozen) |
| Japanese LLM fine-tune | WM_heat (∂T/∂t = α∂²T/∂x²) |
| Math LLM fine-tune | WM_burgers (∂u/∂t + u∂u/∂x = ν∂²u/∂x²) |
| Japanese Math LLM (merged) | WM_merged (advection-diffusion, compound) |
| MGSM benchmark | Péclet sweep (Pe ∈ {0.1, 1, 10, 100}) |
The JEPA pretraining strategy (no reconstruction loss) is used as a substitute for the shared base: by removing the decoder, the encoder learns latent representations abstracted from phenomenon-specific textures, enabling geometric compatibility across WM_heat and WM_burgers.
Heat equation (parabolic, smooth, diffusion-dominated)
∂T/∂t = α ∂²T/∂x²
x ∈ [0,1], α ∈ [0.01, 0.5], T=100 time steps
Burgers equation (hyperbolic, shock-forming, advection-dominated)
∂u/∂t + u ∂u/∂x = ν ∂²u/∂x²
x ∈ [0,1], ν ∈ [0.001, 0.1], T=100 time steps
Advection-diffusion equation
∂u/∂t + U ∂u/∂x = α ∂²u/∂x²
Péclet number Pe = UL/α ∈ {0.1, 1, 10, 100}
Pe → 0: reduces to heat equation (diffusion-dominated) Pe → ∞: reduces to Burgers equation (advection-dominated)
The Péclet sweep is the primary evaluation axis. A successful merge should achieve lower rollout error than either source WM across the full Pe range, with the mixing coefficients recovered by CMA-ES correlating monotonically with Pe.
Input field u(x, t) [batch × T_context × N_grid]
│
▼
┌─────────────────────┐
│ Encoder (frozen) │ 1D spatial transformer
│ θ_enc │ shared, pretrained on mixed data
└────────┬────────────┘
│ z_t [batch × d_latent]
▼
┌─────────────────────┐
│ Predictor │ GRU or Transformer
│ θ_pred_heat │ phenomenon-specific, fine-tuned
│ θ_pred_burgers │ ← merge target (PS + DFS)
└────────┬────────────┘
│ ẑ_{t+k}
▼
Rollout loss (latent-space MSE only, no decoder)
The encoder is pretrained with JEPA temporal causal masking: given context frames u(x, t-n)...u(x, t), predict the latent representation of u(x, t+k) without reconstructing the pixel field. The EMA target encoder prevents representation collapse.
Three merge configurations are evaluated:
PS (Parameter Space) merge
- DARE-TIES sparsification applied to each Predictor's task vector
- CMA-ES (via Optuna) optimizes per-layer mixing coefficients
λ_i - Fitness: rollout MSE on a small advection-diffusion validation split
- Search space:
2 × n_layersscalar parameters
DFS (Data Flow Space) merge
- Predictor layer weights kept intact
- CMA-ES searches for optimal layer sequence across both Predictors
- Indicator array
I ∈ {0,1}^T,T = M × r(M=total layers, r=3) - Scaling matrix
W_ijcorrects distribution shift between layers
PS + DFS (combined)
- PS merge first → intermediate merged Predictor
- DFS applied with intermediate + WM_burgers Predictor
This project uses a two-agent structure under Claude Code:
Main Agent (orchestration, training, merge, evaluation)
Handles all model code, training loops, CMA-ES optimization, and evaluation. Reads simulation data produced by the sub-agent from data/. See CLAUDE.md for full implementation instructions.
Simulation Sub-agent (numerical data generation) Invoked by the main agent via:
claude -p "$(cat prompts/subagent_sim.txt)" --output-format jsonResponsible exclusively for generating .npy trajectory files and writing data/sim_manifest.json with physical validation results. Has no access to model code or checkpoints.
The separation is deliberate: numerical simulation (finite difference schemes, physical validation) is a self-contained, stateless task that benefits from isolated execution and deterministic outputs. The main agent reads only validated, checksummed data from the sub-agent.
| Metric | Formula | Significance |
|---|---|---|
| Short-horizon rollout MSE | ‖û_{t+5} − u_{t+5}‖² |
Basic predictive accuracy |
| Long-horizon rollout MSE | ‖û_{t+50} − u_{t+50}‖² |
Stability and error accumulation |
| Energy spectrum error | ‖Ê(k) − E(k)‖₂ / ‖E(k)‖₂ |
Correct representation of diffusion and advection scales |
| Conservation residual | ` | ∫û dx − ∫u dx |
The Péclet sweep (Pe ∈ {0.1, 1, 10, 100}) provides the primary analysis axis. The secondary analysis examines whether the CMA-ES-recovered mixing coefficients λ_heat(Pe) increase monotonically as Pe → 0 — evidence that evolutionary search implicitly estimates the physical regime.
Primary (merge works):
MSE_merged(Pe) < min(MSE_heat(Pe), MSE_burgers(Pe)) for all Pe tested
Secondary (physical interpretation):
corr(λ_heat, 1/Pe) > 0.9
i.e., the heat Predictor's weight grows as the problem becomes more diffusion-dominated.
Tertiary (DFS structural hypothesis): The DFS-discovered layer sequence should begin with WM_heat layers (smooth early diffusion) and transition to WM_burgers layers (shock formation) — a data-driven recovery of the operator-splitting structure of advection-diffusion solvers.
git clone https://github.com/yourorg/evophyswm
cd evophyswm
pip install -e ".[dev]"Requirements: Python 3.11+, PyTorch 2.x, NumPy, SciPy, Optuna, tqdm.
# Step 1: Generate simulation data (sub-agent)
bash scripts/run_subagent_sim.sh
# Step 2–4: Full training + merge + evaluation pipeline
bash scripts/run_full_pipeline.sh
# Results
cat outputs/results.jsonExpected runtime on a single A100 (or equivalent): ~6 hours for the full pipeline.
evophyswm/
├── CLAUDE.md ← agent instructions and architecture spec
├── README.md ← this file
├── pyproject.toml
├── configs/
│ ├── base.yaml ← JEPA pretraining config
│ ├── finetune.yaml ← phenomenon fine-tuning config
│ ├── merge.yaml ← CMA-ES merge config
│ └── eval.yaml ← Péclet sweep config
├── prompts/
│ └── subagent_sim.txt ← sub-agent dispatch prompt template
├── data/ ← populated by sub-agent (not committed)
├── src/
│ ├── models/ ← encoder, predictor, world_model
│ ├── train/ ← pretrain_base, finetune
│ ├── merge/ ← ps_merge, dfs_merge, fitness
│ └── eval/ ← metrics, peclet_sweep
├── checkpoints/ ← model checkpoints (not committed)
├── outputs/ ← results and figures
└── scripts/
├── run_full_pipeline.sh
├── run_subagent_sim.sh
└── run_eval.sh
- Akiba T. et al. "Evolutionary optimization of model merging recipes." Nature Machine Intelligence 7, 195–204 (2025). https://doi.org/10.1038/s42256-024-00975-8
- Assran M. et al. "Self-Supervised Learning from Images with a Joint-Embedding Predictive Architecture." CVPR 2023. (I-JEPA)
- Hafner D. et al. "Dream to Control: Learning Behaviors by Latent Imagination." ICLR 2020. (RSSM)