[tx] General implementation of trainable Hyper Connections #1008

tanmaysachan · 2026-02-02T17:26:37Z

Addresses #952

This PR is a general implementation of Hyper connections.

This is supposed to be an extension like Lora, where the default case mimics a standard residual connection with identity mappings.

Default case - Trainable is false. Expansion rate is 1.

H_res is a single value matrix [1]
H_pre and H_post are vectors of [1, 1, 1, ...] that result in no-op matmuls

For expansion rate > 1

H_res is initialized as identity of size nxn (n is the expansion rate)
H_pre is [1/n, 1/n, ...]
H_post is [1, 1, 1, ...]

Todos

simplify rms integration - added elementwise_affine as a flag
Benchmark/ensure no regression for expansion_rate = 1 - minimal difference in step time when expansion rate is 1 and untrainable.

Future work

Fine tune on custom data with mHC + LoRA to see perf gains

gemini-code-assist

Code Review

This pull request introduces a general implementation of Hyper Connections as an extension to the transformer layers. The changes are mainly in tx/layers/connectors.py where the Connector module is defined, and in tx/models/deepseekv3.py to integrate it into the decoder layers.

My review found a couple of issues:

An unused trainable parameter in the Connector class which should be removed for clarity.
A bug in DeepseekV3Model when handling intermediate hidden states for expansion_rate > 1, where squeeze() is used incorrectly.

Overall, the implementation of the Hyper Connections logic seems to follow the intended pattern of pre/post processing around existing attention and MLP blocks. The changes are well-contained. Addressing the mentioned points will improve the robustness and clarity of the implementation.

gemini-code-assist · 2026-02-02T17:28:32Z

skyrl-tx/tx/models/deepseekv3.py

        for layer_idx, layer in enumerate(self.layers):
            if output_hidden_states:
-                all_hidden_states.append(hidden_states)
+                all_hidden_states.append(hidden_states.squeeze())


hidden_states.squeeze() is used here to process intermediate hidden states. This will only work correctly if expansion_rate is 1. For expansion_rate > 1, squeeze() will have no effect because the expansion dimension has size n > 1. This will result in appending a tensor with an incorrect shape (..., n, C) to all_hidden_states, which is inconsistent with other states and likely to cause issues downstream.

A more robust approach is to aggregate across the expansion dimension, for example by taking the mean.

Suggested change

all_hidden_states.append(hidden_states.squeeze())

all_hidden_states.append(hidden_states.mean(axis=-2))

gemini-code-assist · 2026-02-02T17:28:32Z

skyrl-tx/tx/layers/connectors.py

+        hidden_dim: int,
+        expansion_rate: int,
+        *,
+        trainable: bool = False,


The trainable parameter is defined but it is not used anywhere in the Connector class. This could be misleading for developers using this module. Consider removing it from the method signature, and also the assignment self.trainable = trainable on line 27, to improve code clarity.

tanmaysachan · 2026-02-02T19:08:15Z

skyrl-tx/tx/layers/layernorm.py

        self.eps = eps
        self.weight = Param(
-            size, dtype=dtype, kernel_init=nnx.with_partitioning(nnx.initializers.normal(), jax.P(None)), rngs=rngs
+            size, dtype=dtype, kernel_init=nnx.with_partitioning(nnx.initializers.ones_init(), jax.P(None)), rngs=rngs


Temporary, testing

https://docs.pytorch.org/docs/stable/generated/torch.nn.modules.normalization.RMSNorm.html

Torch also initalizes to one by default

… rms in mhc

tanmaysachan added 3 commits February 2, 2026 22:53

Initial design

57d1881

Merge branch 'main' into tanmay/mhc

98c0994

Add comment

91d5e74

gemini-code-assist bot reviewed Feb 2, 2026

View reviewed changes

tanmaysachan added 2 commits February 2, 2026 23:58

Identity mapping for initial passthrough

24b82d7

Add trainable flag for freezing weights

874ab08

pcmoritz added the tx label Feb 2, 2026

Stray comment

975faa1

tanmaysachan commented Feb 2, 2026

View reviewed changes

tanmaysachan added 2 commits February 3, 2026 00:55

simplify

f685543

Add elementwise_affine flag to RMS to match pytorch impl. Replace raw…

e493ae5

… rms in mhc

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[tx] General implementation of trainable Hyper Connections #1008

[tx] General implementation of trainable Hyper Connections #1008

Uh oh!

tanmaysachan commented Feb 2, 2026 •

edited

Loading

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Feb 2, 2026

Uh oh!

gemini-code-assist bot Feb 2, 2026

Uh oh!

tanmaysachan Feb 2, 2026

Uh oh!

tanmaysachan Feb 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

	all_hidden_states.append(hidden_states.squeeze())
	all_hidden_states.append(hidden_states.mean(axis=-2))

[tx] General implementation of trainable Hyper Connections #1008

Are you sure you want to change the base?

[tx] General implementation of trainable Hyper Connections #1008

Uh oh!

Conversation

tanmaysachan commented Feb 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Default case - Trainable is false. Expansion rate is 1.

For expansion rate > 1

Todos

Future work

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Feb 2, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Feb 2, 2026

Choose a reason for hiding this comment

Uh oh!

tanmaysachan Feb 2, 2026

Choose a reason for hiding this comment

Uh oh!

tanmaysachan Feb 2, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

tanmaysachan commented Feb 2, 2026 •

edited

Loading