Connectome-JEPA Autoresearch Program

You are an autonomous ML researcher optimizing a biologically-constrained language model for the OpenAI Parameter Golf challenge. Your goal is to minimize val_bpb (validation bits per byte) — lower is better.

Current Best

The current best configuration achieves val_bpb = 1.9935 with:

6 connectome hops, skip_every=2, cross_attn_every=2, head_dim=128
1.64M params, 1.48MB artifact, trained in 8.5 min on 8×H100
AdamW lr=1e-3, cosine schedule, warmup=100, 5000 steps

Setup (one-time)

Create a branch: git checkout -b autoresearch/<tag> from current main
Read the in-scope files:
- train_connectome_jepa.py — the file you modify. Model architecture, training loop, hyperparameters. Everything is fair game.
- This program.md — your instructions. Do not modify.
Data lives on a Modal volume parameter-golf-data at /data/fineweb10B_sp1024/ and /data/tokenizers/. It is already downloaded.
Initialize results.tsv with a header row: experiment_id\tdescription\tval_bpb\tval_loss\tparams\tartifact_bytes\tsteps\tstep_ms\tstatus
Confirm setup looks good. Once confirmed, kick off experimentation.

Experiment Loop

Each experiment runs on Modal with 8×H100 using a fixed budget of 1000 steps (GRAD_ACCUM_STEPS=1). This takes approximately 100-120 seconds wall time, allowing rapid iteration.

For each experiment:

Hypothesize: Write a one-line description of what you're testing and why it should help.
Modify: Edit train_connectome_jepa.py with your experimental change. Make focused, single-variable changes. Commit with message: exp: <description>.
Run the experiment on Modal:

cd /home/datum/parameter-golf && \
modal run run_modal_exp.py -- autoresearch_exp_<N> [KEY=VALUE ...]

Any KEY=VALUE pairs override the defaults (which match the current best 6-hop config).

Examples:

# Test 8 hops
modal run run_modal_exp.py -- exp_01 NUM_HOPS=8
# Test wider embedding
modal run run_modal_exp.py -- exp_02 EMBED_DIM=768
# Test more attention heads
modal run run_modal_exp.py -- exp_03 NUM_ATTN_HEADS=4

Read results: The Modal runner returns a structured dict with val_bpb, val_loss, params, artifact_bytes, step_ms, exit_code.
- If exit_code is non-zero, the run crashed. Check the error field and attempt a fix. Give up after 3 attempts and revert.
Record in results.tsv: experiment_id, description, val_bpb, val_loss, params, artifact_bytes, steps, step_ms, status (kept/reverted/crashed)
Decision:
- If val_bpb improved (lower than current best): keep the commit, update current best.
- If val_bpb is equal or worse: revert with git reset --hard HEAD~1
Repeat with a new hypothesis.

Repository

Git repo: https://github.com/ericdatum/connectome-golf.git (remote: origin)
Upstream: https://github.com/openai/parameter-golf.git (remote: upstream)
Work on branch autoresearch/<tag>, push periodically to origin.

Hard Constraints (do not violate)

Artifact size must be under 16,000,000 bytes (code + compressed model). Currently at 1.48MB so you have massive headroom.
Do NOT modify the adjacency matrix data (ADJ_B64). The C. elegans connectome wiring is the core scientific claim. You can change how it's used but not what it is.
Do NOT remove the JEPA or SigREG losses. You can change their weights (including setting very low) but the losses must remain in the code. They are part of the submission's identity.
Do NOT replace MaskedLinear with dense layers. The sparse connectome constraint is the point. You can add dense layers elsewhere (projections, attention, etc.) but the hop layers must use the connectome mask.
The model must produce valid val_bpb via the existing evaluation code. Don't modify the eval/BPB calculation.

What to Explore (prioritized)

High Priority — Architecture

Number of hops: Try 4, 8, 10, 12. With skip connections, deeper may now work. Current best is 6.
Skip connection frequency: Try skip_every=1 (every hop), skip_every=3, skip_every=0 (disabled).
Cross-attention frequency: Try cross_attn_every=1 (every hop), cross_attn_every=3.
Multi-head attention: Try num_attn_heads=2 or 4 (currently 1).
Head dimension: Try head_dim=64, 256 (currently 128).
Parallel connectome channels: Run multiple independent connectome circuits with shared masks. Try num_channels=2 or 4.
Wider embedding: Try embed_dim=768 or 1024 (will increase params but you have massive headroom).

Medium Priority — Training

Learning rate: Try 3e-4, 5e-4, 2e-3, 3e-3.
Weight decay: Try 0.01, 0.05, 0.2.
JEPA weight: Try 0.1, 0.25, 1.0 (currently 0.5).
SigREG weight: Try 1e-3, 1e-5 (currently 1e-4).
Mask ratio: Try 0.15, 0.5 (currently 0.3).

Low Priority — Compression & Eval

Sequence length: Try TRAIN_SEQ_LEN=2048.
Batch size: Try TRAIN_BATCH_TOKENS=1048576 (2× current).
Activation function: Try SiLU, LeakyReLU, ReLU² instead of GELU in the connectome hops.

Creative / Speculative

Learnable mask refinement: Add a small learnable perturbation to the binary connectome mask (sigmoid(logits) * base_mask) so the model can slightly adjust connection strengths while preserving topology.
Bidirectional connectome: Add the transpose of the adjacency matrix as a second mask, so information flows both directions through the connectome.
Depth-wise scaling: Scale hop residuals differently at different depths (e.g., smaller early, larger late).

Strategy Notes

Make one change at a time. Multi-variable changes make it impossible to attribute improvements.
Explore broadly first, then exploit. Try one experiment from each high-priority category before going deep on any single direction.
Respect the parameter budget. At 1.64M params you're using <10% of the 16MB budget. You can go 10× bigger if it helps. But bigger isn't always better — the whole point of this architecture is parameter efficiency.
Watch for the gradient flow problem. If you increase depth without proper skip connections, the model will stop learning (CE stays at ~6.93). This is the hardest-won lesson from development. If you see CE not dropping in the first 50 steps, the architecture is broken.
The cross-position attention is where most params live. The connectome hops themselves are tiny (~5k weights each). The attention layers are ~100k each. If you're budget-constrained, add more hops before adding more attention.
1000 steps is enough to see trends but not convergence. A change that looks +0.05 BPB better at 1000 steps may be +0.1 better at 5000 steps, or it may plateau. Use 1000-step runs for screening, then do a full 5000-step run on the best config.

Reporting

After each session, summarize:

Total experiments run
Experiments kept vs reverted
Best val_bpb achieved and the configuration that produced it
Top 3 most impactful changes (positive or negative)
Recommended next directions

Leave results.tsv and run.log files untracked by git.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Connectome-JEPA Autoresearch Program

Current Best

Setup (one-time)

Experiment Loop

For each experiment:

Repository

Hard Constraints (do not violate)

What to Explore (prioritized)

High Priority — Architecture

Medium Priority — Training

Low Priority — Compression & Eval

Creative / Speculative

Strategy Notes

Reporting

FilesExpand file tree

program.md

Latest commit

History

program.md

File metadata and controls

Connectome-JEPA Autoresearch Program

Current Best

Setup (one-time)

Experiment Loop

For each experiment:

Repository

Hard Constraints (do not violate)

What to Explore (prioritized)

High Priority — Architecture

Medium Priority — Training

Low Priority — Compression & Eval

Creative / Speculative

Strategy Notes

Reporting