Skip to content

Non-record: S4D-Lin SSM Hybrid — Fixing Why Mamba Failed in Parameter…#1013

Open
himanshudongre wants to merge 2 commits intoopenai:mainfrom
himanshudongre:nonrecord/ssm-s4dlin-hybrid
Open

Non-record: S4D-Lin SSM Hybrid — Fixing Why Mamba Failed in Parameter…#1013
himanshudongre wants to merge 2 commits intoopenai:mainfrom
himanshudongre:nonrecord/ssm-s4dlin-hybrid

Conversation

@himanshudongre
Copy link
Copy Markdown

@himanshudongre himanshudongre commented Mar 28, 2026

Summary

  • First functional SSM in Parameter Golf with zero throughput penalty (116ms/step, matching baseline)
  • Previous SSM attempts (Hymba, PR [Non-Record] Hymba: Hybrid Attention + Mamba SSM (val_bpb 1.1828) #599) used Mamba's selective scan (custom CUDA kernels) = 3.4x throughput penalty. S4D-Lin uses standard F.conv1d — pure PyTorch, fully torch.compile compatible
  • Result: 1.1682 bpb post-GPTQ-int5 (vs 1.1194 SOTA). Throughput problem solved, but attention quality > SSM quality in lower layers at this scale
  • Artifact: 13.0 MB (well under 16MB cap)
  • Checks off "State-space models" from Requests for PRs

Research Journey

This work started after my PR #846 (two-pass n-gram rescoring) was closed in the enforcement sweep. I committed to pure architectural innovation — no eval tricks. The journey:

  1. JEPA-LM (see companion PR) — promising on synthetic data (-19.5% CE), failed on real text (-0.24%)
  2. Monarch matrices — inconclusive local results
  3. S4D-Lin SSM (this PR) — solved the throughput problem but quality gap remains

Key Findings

Finding Detail
SSM throughput is solvable S4D-Lin matches transformer speed via standard F.conv1d
Attention > SSM in lower layers At dim=512, attention provides more value than SSM's O(n) advantage
Local tests mislead -18% CE at dim=192 → +2.7% BPB at dim=512
Quantization sensitivity GPTQ int5 degrades SSM weights more than attention weights (0.018 bpb loss)
Export config matters merged_leader defaults to rowclip_int6 but submissions need full_gptq_int5

Architecture

Hybrid Transformer: 2 lower layers = S4D-Lin SSM blocks (causal depthwise conv1d with learned exponentially-decaying kernels), 9 upper layers = standard Transformer with XSA attention.

kernel[d, t] = C[d] * exp(-rate[d] * t)    # Multi-scale temporal receptive fields

Reproduction

SSM_LAYERS=2 SSM_KERNEL_SIZE=64 TTT_ENABLED=0 NGRAM_EVAL_ENABLED=0 \
INT6_TENSOR_CLASSES=attention,mlp,local_mixer,ssm_proj \
EXPORT_QUANTIZER=full_gptq_int5 EXPORT_COMPRESSOR=lzma \
torchrun --nproc_per_node=8 train_gpt.py

See full README

… Golf

First functional SSM architecture in Parameter Golf with zero throughput
penalty. Previous SSM attempts (Hymba, PR openai#599) used Mamba's selective scan
which requires custom CUDA kernels, resulting in 3.4x throughput penalty.
S4D-Lin replaces this with standard F.conv1d — pure PyTorch, fully
torch.compile compatible.

Results: 1.1682 bpb post-GPTQ-int5 (vs 1.1194 SOTA). The throughput problem
is solved (116ms/step, matching baseline) but attention quality > SSM quality
in lower layers at this scale. Detailed analysis of why, plus lessons for
future SSM work.

Checks off "State-space models" from Requests for PRs.
himanshudongre added a commit to himanshudongre/parameter-golf that referenced this pull request Mar 28, 2026
@MatoTeziTanka
Copy link
Copy Markdown

MatoTeziTanka commented Mar 28, 2026

This is one of the most thorough research contributions in the competition. The journey from #846 (banned) through JEPA (failed on real text) to S4D-Lin (solved throughput, revealed quality gap) is exactly the kind of honest iteration that produces real insights.

A few things stood out:

The scale deception finding is the most important takeaway here. -18% CE at dim=192 → +2.7% BPB at dim=512 is a warning that should be pinned to the top of every future architecture PR. Local small-scale experiments can be actively misleading, not just noisy — they can point in the opposite direction.

The S4D-Lin implementation is clean. Causal left-padding, normalized kernels, gated output with zero-init on the projection, CastedLinear for GPTQ compatibility — this is a proper drop-in block that others can reuse. The F.conv1d approach proving zero throughput penalty (116ms vs 117ms baseline) definitively answers the "can SSMs work in Parameter Golf" question on the throughput side.

The 0.049 BPB quality gap is interesting but maybe not final. You tested SSM in the bottom 2 layers only. Have you considered flipping it — SSM in upper layers where local pattern completion matters more, attention in lower layers for global context? The MoE-Mamba literature suggests interleaved placement outperforms bottom-stacking. Also curious if the kernel stays exponential after training or drifts to a different shape.

Two minor notes for completeness:

  1. No training logs in the submission (just the README table). Adding the raw log would help anyone trying to reproduce.
  2. The 0.018 BPB quantization gap (1.1499 → 1.1682) is larger than typical attention-only models. Might be worth trying int6 with the ssm_proj class you added, since the extra bit could recover some of that.

Really solid work. The fact that this is self-funded at $47 makes it even more impressive.


Disclosure: I use Claude Code CLI, Codex CLI, and Gemini Pro as tools in my workflow. Human first, AI-assisted.

@himanshudongre
Copy link
Copy Markdown
Author

@MatoTeziTanka Thank you for the incredibly thorough review — these are exactly the kinds of questions that push this research forward.

Training logs

You're right that the raw logs were missing. Unfortunately, the SSM-specific GPU run was done on a self-funded RunPod instance and I ran out of credits mid-session — the pod got terminated before I could extract the SSM variant's training logs. The 3-seed baseline logs (from the SOTA PR #809 stack) are included, and the SSM submission was a modification of that same codebase with SSM_LAYERS=2. I'll add the local experiment result files that do exist:

  • ssm_hybrid_results.json — throughput + quality comparison across 5 configurations (dim=192, local)
  • ssm_quality_512_results.json — quality at dim=512 (local)

These contain the full metrics tables referenced in the README. Painful lesson learned — always pull logs before anything else when credits are running low.

Layer placement — upper layers vs bottom-stacking

Great suggestion, and the MoE-Mamba interleaved placement reference is spot-on. My choice of bottom layers (0,1) was based on the hypothesis that SSM's temporal convolution would capture local n-gram patterns that feed into upper attention layers for longer-range dependencies.

I haven't tested upper-layer or interleaved placement on GPU yet (budget constraints — this was self-funded at ~$47/run, pod now terminated). But looking at the local data, there's a suggestive signal:

Config (dim=192, local) Final CE
SSM-Light (layers 0,1) 1.3245
SSM-First (layers 0,1,2) 1.3406
Pure Transformer 1.3312

SSM in layers 0,1 beat the baseline, but adding layer 2 made it worse. This hints that SSM may indeed work better in specific positions rather than "more SSM = better." Upper-layer or interleaved placement (e.g., layers 0, 5, 10) is on my list when I get compute access again.

Kernel shape drift

This is a question I hadn't analyzed until you asked, and the answer turns out to be genuinely interesting. I ran 500 training steps on a local model (dim=192, layers 0,1 as SSM) and compared kernel shapes before and after:

Slow-decay channels (long memory): Roughly maintain exponential shape but with perturbations. Successive-timestep ratios vary between 0.5–1.3 instead of the constant ratio that a pure exponential would give.

Medium-decay channels: Still mostly exponential (ratios ~0.65–0.75) but noisier.

Fast-decay channels (short memory): These are the surprise — the kernel tails go negative after training. Values like -0.006 appear where initialization was strictly positive. The kernels learn to subtract recent information — an anti-correlation pattern that's impossible with pure exponential decay.

Example (Layer 0, Channel 144):

Init:  0.554  0.247  0.110  0.049  0.022  0.010  0.004  0.002
Post:  0.520  0.214  0.094  0.025  0.014  0.011 -0.006 -0.002

The kernel starts as a smooth exponential decay but learns a "Mexican hat"-like shape — strong positive at t=0, decaying, then slightly negative. This is essentially a learned difference-of-Gaussians filter, which in signal processing terms detects change points rather than accumulating history. Makes sense for language — "the previous token was X but the one before wasn't" is useful information.

Caveat: this analysis is from a local run (dim=192, 500 steps on MPS) so the exact values may not hold at competition scale (dim=512, 6000+ steps on H100). But the qualitative finding — kernels drifting away from pure exponential toward non-monotonic shapes — is likely directionally correct.

This suggests the exponential initialization is a good starting point but the model benefits from the freedom to learn non-monotonic kernels. A future experiment: initialize with a richer family (e.g., damped oscillatory kernels) and see if it converges faster or to a better minimum.

Int6 for SSM layers

Good call. The 0.018 BPB quantization gap (1.1499 pre-quant → 1.1682 post-GPTQ) is indeed larger than typical. The SSM kernels have a very specific structure (exponential decay ± learned perturbations) that doesn't quantize well with per-row uniform grids. Int6 for the ssm_proj class is a clean fix — the submission code already supports INT6_TENSOR_CLASSES=attention,mlp,local_mixer,ssm_proj. I'd estimate recovering ~0.005–0.008 BPB from the extra bit, though that needs a GPU run to confirm.


Thanks again for engaging with this at depth. The layer placement and kernel drift questions have genuinely opened new thoughts. I still have several ideas I want to test — including interleaved SSM placement, alternative kernel families, and some novel architectural approaches I've been prototyping locally on my Mac Mini M4. Unfortunately I'm completely out of RunPod credits at this point, so GPU-scale validation is on hold. I've applied for the development grant and am awaiting the result. Once (and if) I get credits, these will be the first things I run. In the meantime, I'll keep iterating locally.

If anyone wants to fork this and test interleaved SSM placement or alternative kernel families, the code is designed to be drop-in modular.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants