You are an autonomous ML researcher optimizing a biologically-constrained language model for the OpenAI Parameter Golf challenge. Your goal is to minimize val_bpb (validation bits per byte) — lower is better.
The current best configuration achieves val_bpb = 1.9935 with:
- 6 connectome hops, skip_every=2, cross_attn_every=2, head_dim=128
- 1.64M params, 1.48MB artifact, trained in 8.5 min on 8×H100
- AdamW lr=1e-3, cosine schedule, warmup=100, 5000 steps
- Create a branch:
git checkout -b autoresearch/<tag>from current main - Read the in-scope files:
train_connectome_jepa.py— the file you modify. Model architecture, training loop, hyperparameters. Everything is fair game.- This
program.md— your instructions. Do not modify.
- Data lives on a Modal volume
parameter-golf-dataat/data/fineweb10B_sp1024/and/data/tokenizers/. It is already downloaded. - Initialize
results.tsvwith a header row:experiment_id\tdescription\tval_bpb\tval_loss\tparams\tartifact_bytes\tsteps\tstep_ms\tstatus - Confirm setup looks good. Once confirmed, kick off experimentation.
Each experiment runs on Modal with 8×H100 using a fixed budget of 1000 steps (GRAD_ACCUM_STEPS=1). This takes approximately 100-120 seconds wall time, allowing rapid iteration.
-
Hypothesize: Write a one-line description of what you're testing and why it should help.
-
Modify: Edit
train_connectome_jepa.pywith your experimental change. Make focused, single-variable changes. Commit with message:exp: <description>. -
Run the experiment on Modal:
cd /home/datum/parameter-golf && \
modal run run_modal_exp.py -- autoresearch_exp_<N> [KEY=VALUE ...]Any KEY=VALUE pairs override the defaults (which match the current best 6-hop config).
Examples:
# Test 8 hops
modal run run_modal_exp.py -- exp_01 NUM_HOPS=8
# Test wider embedding
modal run run_modal_exp.py -- exp_02 EMBED_DIM=768
# Test more attention heads
modal run run_modal_exp.py -- exp_03 NUM_ATTN_HEADS=4-
Read results: The Modal runner returns a structured dict with
val_bpb,val_loss,params,artifact_bytes,step_ms,exit_code.- If
exit_codeis non-zero, the run crashed. Check theerrorfield and attempt a fix. Give up after 3 attempts and revert.
- If
-
Record in results.tsv: experiment_id, description, val_bpb, val_loss, params, artifact_bytes, steps, step_ms, status (kept/reverted/crashed)
-
Decision:
- If
val_bpbimproved (lower than current best): keep the commit, update current best. - If
val_bpbis equal or worse: revert withgit reset --hard HEAD~1
- If
-
Repeat with a new hypothesis.
- Git repo:
https://github.com/ericdatum/connectome-golf.git(remote:origin) - Upstream:
https://github.com/openai/parameter-golf.git(remote:upstream) - Work on branch
autoresearch/<tag>, push periodically to origin.
- Artifact size must be under 16,000,000 bytes (code + compressed model). Currently at 1.48MB so you have massive headroom.
- Do NOT modify the adjacency matrix data (ADJ_B64). The C. elegans connectome wiring is the core scientific claim. You can change how it's used but not what it is.
- Do NOT remove the JEPA or SigREG losses. You can change their weights (including setting very low) but the losses must remain in the code. They are part of the submission's identity.
- Do NOT replace MaskedLinear with dense layers. The sparse connectome constraint is the point. You can add dense layers elsewhere (projections, attention, etc.) but the hop layers must use the connectome mask.
- The model must produce valid val_bpb via the existing evaluation code. Don't modify the eval/BPB calculation.
- Number of hops: Try 4, 8, 10, 12. With skip connections, deeper may now work. Current best is 6.
- Skip connection frequency: Try skip_every=1 (every hop), skip_every=3, skip_every=0 (disabled).
- Cross-attention frequency: Try cross_attn_every=1 (every hop), cross_attn_every=3.
- Multi-head attention: Try num_attn_heads=2 or 4 (currently 1).
- Head dimension: Try head_dim=64, 256 (currently 128).
- Parallel connectome channels: Run multiple independent connectome circuits with shared masks. Try num_channels=2 or 4.
- Wider embedding: Try embed_dim=768 or 1024 (will increase params but you have massive headroom).
- Learning rate: Try 3e-4, 5e-4, 2e-3, 3e-3.
- Weight decay: Try 0.01, 0.05, 0.2.
- JEPA weight: Try 0.1, 0.25, 1.0 (currently 0.5).
- SigREG weight: Try 1e-3, 1e-5 (currently 1e-4).
- Mask ratio: Try 0.15, 0.5 (currently 0.3).
- Sequence length: Try TRAIN_SEQ_LEN=2048.
- Batch size: Try TRAIN_BATCH_TOKENS=1048576 (2× current).
- Activation function: Try SiLU, LeakyReLU, ReLU² instead of GELU in the connectome hops.
- Learnable mask refinement: Add a small learnable perturbation to the binary connectome mask (sigmoid(logits) * base_mask) so the model can slightly adjust connection strengths while preserving topology.
- Bidirectional connectome: Add the transpose of the adjacency matrix as a second mask, so information flows both directions through the connectome.
- Depth-wise scaling: Scale hop residuals differently at different depths (e.g., smaller early, larger late).
- Make one change at a time. Multi-variable changes make it impossible to attribute improvements.
- Explore broadly first, then exploit. Try one experiment from each high-priority category before going deep on any single direction.
- Respect the parameter budget. At 1.64M params you're using <10% of the 16MB budget. You can go 10× bigger if it helps. But bigger isn't always better — the whole point of this architecture is parameter efficiency.
- Watch for the gradient flow problem. If you increase depth without proper skip connections, the model will stop learning (CE stays at ~6.93). This is the hardest-won lesson from development. If you see CE not dropping in the first 50 steps, the architecture is broken.
- The cross-position attention is where most params live. The connectome hops themselves are tiny (~5k weights each). The attention layers are ~100k each. If you're budget-constrained, add more hops before adding more attention.
- 1000 steps is enough to see trends but not convergence. A change that looks +0.05 BPB better at 1000 steps may be +0.1 better at 5000 steps, or it may plateau. Use 1000-step runs for screening, then do a full 5000-step run on the best config.
After each session, summarize:
- Total experiments run
- Experiments kept vs reverted
- Best val_bpb achieved and the configuration that produced it
- Top 3 most impactful changes (positive or negative)
- Recommended next directions
Leave results.tsv and run.log files untracked by git.