Skip to content
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
71 changes: 71 additions & 0 deletions records/track_non_record_16mb/2026-03-23_RIOM_v1_recur/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,71 @@
This package is the first RIOM-flavored non-record variant. It replaces the baseline's 9 distinct blocks with 3 shared transformer blocks repeated 3 times, with a lightweight learned recurrence gate after each shared block application, while keeping the tokenizer, dataset, optimizer family, and export path unchanged.

Why this is RIOM-style:
- Effective depth is increased through parameter sharing instead of distinct weights.
- The 3x3 recurrence keeps the nominal 9-block depth budget while dropping parameter count and compressed bytes sharply.
- The change is isolated to the record-local script. Training flow, quantized export, and tokenizer-aware bpb accounting remain the same as the baseline package.

Architecture:
- shared transformer depth: `3`
- recurrence loops: `3`
- effective depth: `9`
- recurrence gate: per-loop, per-shared-block learned vector mixed through `sigmoid`
- `VOCAB_SIZE=1024`, `MODEL_DIM=512`, `NUM_HEADS=8`, `NUM_KV_HEADS=4`, `MLP_MULT=2`
- tied embeddings preserved

Tokenizer and dataset:
- Tokenizer unchanged: official `fineweb_1024_bpe.model`
- Dataset unchanged: official `fineweb10B_sp1024`
- Training shards present locally: `1/195`
- Validation accounting for this smoke run: first `1,048,576` official validation tokens via `VAL_MAX_TOKENS`

Exact command used:
```bash
source ../../../.venv/bin/activate
RUN_ID=riom_v1_dev_mlx_20260323 \
DATA_PATH=../../../data/datasets/fineweb10B_sp1024 \
TOKENIZER_PATH=../../../data/tokenizers/fineweb_1024_bpe.model \
OUT_DIR=. \
ITERATIONS=50 \
TRAIN_BATCH_TOKENS=8192 \
VAL_LOSS_EVERY=0 \
VAL_BATCH_SIZE=524288 \
VAL_MAX_TOKENS=1048576 \
TRAIN_LOG_EVERY=10 \
python3 -u train_gpt.py
```

Hardware:
- Apple Silicon arm64 MacBook Air
- macOS 26.3.1
- Python 3.13.12
- MLX 0.31.1

Measured results from `train.log`:
- model params: `6,040,088`
- pre-quant: `val_loss=5.4143`, `val_bpb=3.2439`
- post-quant roundtrip: `val_loss=5.42207763`, `val_bpb=3.24862008`
- training time: `234.488s`
- final eval time: `105.973s`

Artifact size accounting:
- code bytes: `51,404`
- compressed model bytes: `2,273,437`
- total counted bytes: `2,324,841`
- raw MLX snapshot bytes: `23,119,132`

Comparison to v0 on the same dev prefix:
- `val_bpb`: `3.2749 -> 3.2486`
- total bytes: `6,258,232 -> 2,324,841`
- params: `17,059,912 -> 6,040,088`

What is unfinished:
- This is still a local development smoke, not an upstream-ready leaderboard submission.
- Validation was capped with `VAL_MAX_TOKENS`; rerun with `VAL_MAX_TOKENS=0` before any upstream submission.
- This MLX-only path demonstrates the idea locally, but a CUDA/PyTorch port is still required for a serious record attempt.
- `author` and `github_id` are set conservatively to the GitHub handle because no separately verified real-name metadata was available in this workspace.

Next planned ablations:
- port this shared-depth recurrence into the official CUDA `train_gpt.py`
- tune recurrence gate initialization and loop-specific learning dynamics
- add sliding-window evaluation on top of this recurrent package
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
mlx==0.31.1
numpy
sentencepiece
huggingface-hub
datasets
tqdm
Original file line number Diff line number Diff line change
@@ -0,0 +1,24 @@
{
"author": "hesong0222-dev",
"github_id": "hesong0222-dev",
"name": "RIOM v1 Shared-Depth Recurrence (MLX dev smoke)",
"blurb": "Non-record RIOM development package: 3 shared transformer blocks repeated 3 times with learned recurrence gates, using the official SP1024 tokenizer/dataset and the first 1,048,576 validation tokens on Apple Silicon MLX.",
"date": "2026-03-23T00:00:00Z",
"track": "non-record-16mb",
"val_loss": 5.42207763,
"val_bpb": 3.24862008,
"pre_quant_val_loss": 5.4143,
"pre_quant_val_bpb": 3.2439,
"step_stop": 50,
"wallclock_seconds": 234.488,
"bytes_total": 2324841,
"bytes_model_int8_zlib": 2273437,
"bytes_code": 51404,
"hardware": "Apple Silicon arm64 (MLX)",
"validation_tokens": 1048576,
"effective_layers": 9,
"shared_depth": 3,
"recur_loops": 3,
"evaluation_mode": "standard",
"note": "Local development run on a validation prefix; rerun with VAL_MAX_TOKENS=0 before any upstream submission."
}
Loading