Non-record: S4D-Lin SSM Hybrid — Fixing Why Mamba Failed in Parameter…#1013
Non-record: S4D-Lin SSM Hybrid — Fixing Why Mamba Failed in Parameter…#1013himanshudongre wants to merge 2 commits intoopenai:mainfrom
Conversation
… Golf First functional SSM architecture in Parameter Golf with zero throughput penalty. Previous SSM attempts (Hymba, PR openai#599) used Mamba's selective scan which requires custom CUDA kernels, resulting in 3.4x throughput penalty. S4D-Lin replaces this with standard F.conv1d — pure PyTorch, fully torch.compile compatible. Results: 1.1682 bpb post-GPTQ-int5 (vs 1.1194 SOTA). The throughput problem is solved (116ms/step, matching baseline) but attention quality > SSM quality in lower layers at this scale. Detailed analysis of why, plus lessons for future SSM work. Checks off "State-space models" from Requests for PRs.
|
This is one of the most thorough research contributions in the competition. The journey from #846 (banned) through JEPA (failed on real text) to S4D-Lin (solved throughput, revealed quality gap) is exactly the kind of honest iteration that produces real insights. A few things stood out: The scale deception finding is the most important takeaway here. -18% CE at dim=192 → +2.7% BPB at dim=512 is a warning that should be pinned to the top of every future architecture PR. Local small-scale experiments can be actively misleading, not just noisy — they can point in the opposite direction. The S4D-Lin implementation is clean. Causal left-padding, normalized kernels, gated output with zero-init on the projection, CastedLinear for GPTQ compatibility — this is a proper drop-in block that others can reuse. The The 0.049 BPB quality gap is interesting but maybe not final. You tested SSM in the bottom 2 layers only. Have you considered flipping it — SSM in upper layers where local pattern completion matters more, attention in lower layers for global context? The MoE-Mamba literature suggests interleaved placement outperforms bottom-stacking. Also curious if the kernel stays exponential after training or drifts to a different shape. Two minor notes for completeness:
Really solid work. The fact that this is self-funded at $47 makes it even more impressive. Disclosure: I use Claude Code CLI, Codex CLI, and Gemini Pro as tools in my workflow. Human first, AI-assisted. |
|
@MatoTeziTanka Thank you for the incredibly thorough review — these are exactly the kinds of questions that push this research forward. Training logsYou're right that the raw logs were missing. Unfortunately, the SSM-specific GPU run was done on a self-funded RunPod instance and I ran out of credits mid-session — the pod got terminated before I could extract the SSM variant's training logs. The 3-seed baseline logs (from the SOTA PR #809 stack) are included, and the SSM submission was a modification of that same codebase with
These contain the full metrics tables referenced in the README. Painful lesson learned — always pull logs before anything else when credits are running low. Layer placement — upper layers vs bottom-stackingGreat suggestion, and the MoE-Mamba interleaved placement reference is spot-on. My choice of bottom layers (0,1) was based on the hypothesis that SSM's temporal convolution would capture local n-gram patterns that feed into upper attention layers for longer-range dependencies. I haven't tested upper-layer or interleaved placement on GPU yet (budget constraints — this was self-funded at ~$47/run, pod now terminated). But looking at the local data, there's a suggestive signal:
SSM in layers 0,1 beat the baseline, but adding layer 2 made it worse. This hints that SSM may indeed work better in specific positions rather than "more SSM = better." Upper-layer or interleaved placement (e.g., layers 0, 5, 10) is on my list when I get compute access again. Kernel shape driftThis is a question I hadn't analyzed until you asked, and the answer turns out to be genuinely interesting. I ran 500 training steps on a local model (dim=192, layers 0,1 as SSM) and compared kernel shapes before and after: Slow-decay channels (long memory): Roughly maintain exponential shape but with perturbations. Successive-timestep ratios vary between 0.5–1.3 instead of the constant ratio that a pure exponential would give. Medium-decay channels: Still mostly exponential (ratios ~0.65–0.75) but noisier. Fast-decay channels (short memory): These are the surprise — the kernel tails go negative after training. Values like -0.006 appear where initialization was strictly positive. The kernels learn to subtract recent information — an anti-correlation pattern that's impossible with pure exponential decay. Example (Layer 0, Channel 144): The kernel starts as a smooth exponential decay but learns a "Mexican hat"-like shape — strong positive at t=0, decaying, then slightly negative. This is essentially a learned difference-of-Gaussians filter, which in signal processing terms detects change points rather than accumulating history. Makes sense for language — "the previous token was X but the one before wasn't" is useful information. Caveat: this analysis is from a local run (dim=192, 500 steps on MPS) so the exact values may not hold at competition scale (dim=512, 6000+ steps on H100). But the qualitative finding — kernels drifting away from pure exponential toward non-monotonic shapes — is likely directionally correct. This suggests the exponential initialization is a good starting point but the model benefits from the freedom to learn non-monotonic kernels. A future experiment: initialize with a richer family (e.g., damped oscillatory kernels) and see if it converges faster or to a better minimum. Int6 for SSM layersGood call. The 0.018 BPB quantization gap (1.1499 pre-quant → 1.1682 post-GPTQ) is indeed larger than typical. The SSM kernels have a very specific structure (exponential decay ± learned perturbations) that doesn't quantize well with per-row uniform grids. Int6 for the Thanks again for engaging with this at depth. The layer placement and kernel drift questions have genuinely opened new thoughts. I still have several ideas I want to test — including interleaved SSM placement, alternative kernel families, and some novel architectural approaches I've been prototyping locally on my Mac Mini M4. Unfortunately I'm completely out of RunPod credits at this point, so GPU-scale validation is on hold. I've applied for the development grant and am awaiting the result. Once (and if) I get credits, these will be the first things I run. In the meantime, I'll keep iterating locally. If anyone wants to fork this and test interleaved SSM placement or alternative kernel families, the code is designed to be drop-in modular. |
Summary
F.conv1d— pure PyTorch, fullytorch.compilecompatibleResearch Journey
This work started after my PR #846 (two-pass n-gram rescoring) was closed in the enforcement sweep. I committed to pure architectural innovation — no eval tricks. The journey:
Key Findings
F.conv1drowclip_int6but submissions needfull_gptq_int5Architecture
Hybrid Transformer: 2 lower layers = S4D-Lin SSM blocks (causal depthwise conv1d with learned exponentially-decaying kernels), 9 upper layers = standard Transformer with XSA attention.
Reproduction
See full README