[WIP] [tx] Implement stacked weights #1018

pcmoritz · 2026-02-04T03:46:26Z

This is based on all the great work that @raulchen did in #996 and #906, it also fixes the performance regression in decoding vs. the main branch.

gemini-code-assist

Code Review

This pull request introduces StackedDecoderLayers to optimize the transformer forward pass using nnx.vmap and jax.lax.scan, which is a significant performance improvement for training and prefill. The changes are well-encapsulated, making the model code cleaner and more efficient. However, I've identified a critical bug in the load_safetensors utility related to how parameter paths are handled, which would prevent it from loading weights correctly for certain layer types.

gemini-code-assist · 2026-02-04T03:52:18Z

skyrl-tx/tx/utils/models.py

+        updates = []
+        for path, param in nnx.to_flat_state(nnx.state(module)):
+            if filter_fn is not None and not filter_fn(path):
+                continue
+            key = key_prefix + get_param_key(path)
+            if skip_lora and ("lora_A" in path or "lora_B" in path or "lora_scaling" in path or "lora_ranks" in path):
+                continue
+            if "experts" in path:
+                tensor = np.stack([tensors[key_prefix + get_expert_key(path, i)].T for i in range(config.get_num_experts())], axis=0)
+            else:
+                tensor = tensors[key] if "embed_tokens" in key else tensors[key].T
+            if len(path) >= 2 and path[-2] in {"q_proj", "k_proj", "v_proj", "o_proj"}:
+                tensor = tensor.reshape(param.shape)
+            assert param.shape == tensor.shape, f"shape mismatch for {key}"
+            updates.append((path, jax.device_put(tensor.astype(param.dtype), param.sharding)))
+        nnx.update(module, nnx.from_flat_state(updates))


The checks for parameter types like "lora_A" in path or "experts" in path are incorrect. The path variable is a tuple of nnx.path.PathEntry objects, not strings, so these checks will always evaluate to False. This will prevent weights for LoRA, experts, and projections from being loaded correctly.

To fix this, you should convert the path to a tuple of strings before performing these checks. This will ensure that the logic correctly identifies the parameter types and applies the appropriate loading logic.

Suggested change

updates = []

for path, param in nnx.to_flat_state(nnx.state(module)):

if filter_fn is not None and not filter_fn(path):

continue

key = key_prefix + get_param_key(path)

if skip_lora and ("lora_A" in path or "lora_B" in path or "lora_scaling" in path or "lora_ranks" in path):

continue

if "experts" in path:

tensor = np.stack([tensors[key_prefix + get_expert_key(path, i)].T for i in range(config.get_num_experts())], axis=0)

else:

tensor = tensors[key] if "embed_tokens" in key else tensors[key].T

if len(path) >= 2 and path[-2] in {"q_proj", "k_proj", "v_proj", "o_proj"}:

tensor = tensor.reshape(param.shape)

assert param.shape == tensor.shape, f"shape mismatch for {key}"

updates.append((path, jax.device_put(tensor.astype(param.dtype), param.sharding)))

nnx.update(module, nnx.from_flat_state(updates))

updates = []

for path, param in nnx.to_flat_state(nnx.state(module)):

path_str_tuple = tuple(map(str, path))

if filter_fn is not None and not filter_fn(path):

continue

key = key_prefix + get_param_key(path)

if skip_lora and any(p in path_str_tuple for p in ("lora_A", "lora_B", "lora_scaling", "lora_ranks")):

continue

if "experts" in path_str_tuple:

tensor = np.stack([tensors[key_prefix + get_expert_key(path, i)].T for i in range(config.get_num_experts())], axis=0)

else:

tensor = tensors[key] if "embed_tokens" in key else tensors[key].T

if len(path_str_tuple) >= 2 and path_str_tuple[-2] in {"q_proj", "k_proj", "v_proj", "o_proj"}:

tensor = tensor.reshape(param.shape)

assert param.shape == tensor.shape, f"shape mismatch for {key}"

updates.append((path, jax.device_put(tensor.astype(param.dtype), param.sharding)))

nnx.update(module, nnx.from_flat_state(updates))

pcmoritz · 2026-02-04T10:36:13Z

skyrl-tx/tx/layers/stacked.py

+        return self.num_layers
+
+    def __getitem__(self, index: int) -> nnx.Module:
+        """Get view into layer at index (stays synced with stacked state)."""


Add to the docstring that this should only be used for tests and weight loading

pcmoritz · 2026-02-04T11:02:09Z

skyrl-tx/tx/layers/stacked.py

+        return self.get_metadata("_parent")[self.get_metadata("_idx")].shape
+
+
+class StackedDecoderLayers(nnx.Module):


Probably the easiest way to implement DeepSeekV3 is to implement DualStackedDecoderLayers which has two StackedDecoderLayers as members and the same interface as StackedDecoderLayers (modulo the constructor which can take two create_layer_fn functions and takes their respective numbers as arguments). This could be a separate PR.

pcmoritz added 5 commits February 3, 2026 16:08

[tx] Implement stacked layers

7653b1c

add file

320681c

fix

169ec5a

update

872fcf0

update

e3c3ecd

pcmoritz added the tx label Feb 4, 2026

fix ruff

2336a08

gemini-code-assist bot reviewed Feb 4, 2026

View reviewed changes

pcmoritz added 8 commits February 3, 2026 21:55

update

526efa2

update

26d9a43

update

3751008

update

3f2879d

update

52fcccf

update

851dd0f

cleanup

c2cdecb

update

7cfa898

pcmoritz commented Feb 4, 2026

View reviewed changes

pcmoritz added 2 commits February 4, 2026 10:27

fix test

e2352d3

black

de2229e

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WIP] [tx] Implement stacked weights #1018

[WIP] [tx] Implement stacked weights #1018

pcmoritz commented Feb 4, 2026 •

edited

Loading

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Feb 4, 2026

Uh oh!

pcmoritz Feb 4, 2026

Uh oh!

pcmoritz Feb 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

		return self.get_metadata("_parent")[self.get_metadata("_idx")].shape


		class StackedDecoderLayers(nnx.Module):

[WIP] [tx] Implement stacked weights #1018

Are you sure you want to change the base?

[WIP] [tx] Implement stacked weights #1018

Conversation

pcmoritz commented Feb 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Feb 4, 2026

Choose a reason for hiding this comment

Uh oh!

pcmoritz Feb 4, 2026

Choose a reason for hiding this comment

Uh oh!

pcmoritz Feb 4, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

pcmoritz commented Feb 4, 2026 •

edited

Loading