Skip to content

Commit ba2c9b0

Browse files
authored
Merge pull request #19 from heneil/master
HELM Update
2 parents 782d252 + b6731f0 commit ba2c9b0

File tree

9 files changed

+130
-0
lines changed

9 files changed

+130
-0
lines changed

app/projects/helm/assets/HELM.png

86.6 KB
Loading

app/projects/helm/assets/HMLA.png

239 KB
Loading

app/projects/helm/assets/MiCE.png

225 KB
Loading
76.2 KB
Loading
98.9 KB
Loading
177 KB
Loading
135 KB
Loading

app/projects/helm/page.mdx

Lines changed: 119 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,119 @@
1+
import { Authors, Badges } from '@/components/utils'
2+
3+
# HELM: Hyperbolic Large Language Models via Mixture-of-Curvature Experts
4+
5+
<Authors
6+
authors="Neil He⋆, Yale University; Rishabh Anand⋆, Yale University; Hiren Madhu, Yale University; Ali Maatouk, Yale University; Smita Krishnaswamy, Yale University; Leandros Tassiulas, Yale University; Menglin Yang, Yale University; Rex Ying, Yale University"
7+
/>
8+
9+
<Badges
10+
venue="NeurIPS 2025"
11+
github="https://github.com/Graph-and-Geometric-Learning/helm"
12+
arxiv="https://arxiv.org/abs/2505.24722"
13+
pdf="https://arxiv.org/pdf/2505.24722"
14+
/>
15+
16+
17+
## 1. Introduction
18+
---
19+
Large language models (LLMs) have shown great success in text modeling tasks across domains. However, natural language exhibits inherent semantic hierarchies and nuanced geometric structure, which current LLMs do not capture completely owing to their reliance on Euclidean operations. Recent studies have also shown that not respecting the geometry of token embeddings leads to training instabilities and degradation of generative capabilities. These findings suggest that shifting to non-Euclidean geometries can better align language models with the underlying geometry of text. We thus propose to operate fully in Hyperbolic space, known for its expansive, scale-free, and low-distortion properties. We thus introduce HELM, a family of HypErbolic Large Language Models, offering a geometric rethinking of the Transformer-based LLM that addresses the representational inflexibility, missing set of necessary operations, and poor scalability of existing hyperbolic LMs. We additionally introduce a Mixture-of-Curvature Experts model, HELM-MICE, where each expert operates in a distinct curvature space to encode more fine-grained geometric structure from text, as well as a dense model, HELM-D. For HELM-MICE, we further develop hyperbolic Multi-Head Latent Attention (HMLA) for efficient, reduced-KV-cache training and inference. For both models, we develop essential hyperbolic equivalents of rotary positional encodings and RMS normalization. We are the first to train fully hyperbolic LLMs at billion-parameter scale, and evaluate them on well-known benchmarks such as MMLU and ARC, spanning STEM problem-solving, general knowledge, and commonsense reasoning. Our results show consistent gains from our HELM architectures -- up to 4% -- over popular Euclidean architectures used in LLaMA and DeepSeek, highlighting the efficacy and enhanced reasoning afforded by hyperbolic geometry in large-scale LM pretraining.
20+
21+
22+
### 1.1. Motivation: non-Euclidean Structures in Token Distributions
23+
24+
Recent works have spurred the discussion of presence of non-Euclidean structures in text data and token inputs to large language models. In particular, certains general tokens could appear in an abundance of varying contexts. However, there are a small number of such tokens and there are exponentially many more specific tokens that could only appear in very limited scenarios. Such semantic hierarchy results in a scale-free distribution of tokens that cannot be faithfully captured by Euclidean geometry. Furthermore, each token has highly localized structure, as shown below, that is difficult to model using a single geometric space.
25+
![Token Curvature Distribution|scale=0.35](./assets/token_curv.png)
26+
In the figure above, we show the distribution of Ricci curvature of token embeddings from decoder-only LLMs depict substantial variation of negative curvature, implying highly localized hyperbolicity.
27+
28+
## 2. Method
29+
30+
We proposes several modules that work together to build the first hyperbolic large language model:
31+
1. We introduce Mixture-of-Curvature Experts (MiCE), a hyperbolic MoE module in which each expert operates in a distinct curvature space to model fine-grained geometric structures in the token distributions.
32+
2. We proposed Hyperbolic Rotary Positional Encoding (HoPE) and proves nemurous associated theoretical guarantees.
33+
3. We propose Hyperbolic Multi-Head Latent Attention (HMLA), which reduces the memory complexity of the model at inference time.
34+
35+
---
36+
37+
### 2.1 Mixture-of-Curvature Experts (MiCE)
38+
39+
![MiCE Visualization|scale=0.35](./assets/MiCE.png)
40+
41+
Let $\mathbf{x}_t\in\mathbb{L}^{K_1,n}$ be the $t$-th token input, then $\mathrm{MiCE}^{N_s}_{N_r}:\mathbf{x}_t\in\mathbb{L}^{K,n}\to \mathbf{x}_t\in\mathbb{L}^{K,n}$, where $N_r$ is the number of routed experts and $N_s$ is the number of shared experts. First, we pass $\mathbf{x}_t$ through a gating module to obtain the gating scores for each routed expert, denoted as $g_{t,i}$ for $1\leq N_r$, given as,
42+
$$
43+
\begin{equation}
44+
\begin{split}
45+
g_{t,i} = \frac{g'_{t,i}}{\sum_{j=1}^{N_r}g_{t,j}};\,s_{t,j}=\mathrm{act}((\mathbf{x}_t)_s^\top \mathbf{y}_j);\,g'_{t,j} = \begin{cases}
46+
s_{t,j}, \,s_{t,j}\in\mathrm{Topk}(\{s_{t,k}\}_{k\leq N_r}, K_r)\\
47+
0\,\,\,\,\,\,\,\,\text{otherwise}
48+
\end{cases}
49+
\end{split}.
50+
\end{equation}
51+
$$
52+
53+
Here, $s_{t,j}$ is the token-expert affinity with activation function $\mathrm{act}$, $\mathbf{y}_j$ is the centroid vector of the $i$-th routed expert, $\mathrm{Topk}(S, A)$ picks the top $A$ values from set $S$, and $K_r$ is the number of activated experts. Then, the token is passed through each shared and routed expert. Let $\mathrm{HFFN}_{r,i}:\mathbb{L}^{K_{r,i}, m}\to \mathbb{L}^{K_{r,i}, m}$ be the routed experts and $\mathrm{HFFN}_{s,i}:\mathbb{L}^{K_{s,i}, m}\to \mathbb{L}^{K_{s,i}, m}$ be the shared experts, defined through hyperbolic feedforward networks. Here, the value of $K_{r,i}$ and $K_{s,i}$ can vary for each expert, i.e., $\textbf{each expert lives on a distinct manifold}$. To align the input's manifold and the experts' manifolds, first we project the tokens to the expert manifolds via $\mathbf{s}_{t,i} = \sqrt{K/K_{s,i}}\mathbf{x}_t$ and $\mathbf{r}_{t,i} = \sqrt{K/K_{r,i}}\mathbf{x}_t$. The projected token is passed through each expert and projected back to the input manifold, where we obtain $\mathbf{y}_{t,i} = \sqrt{K_{s,i}/K}\mathrm{HFFN}_{r,i}\left(\mathbf{s}_{t,i}\right)$ and $\mathbf{z}_{t,i} = \sqrt{K_{r,i}/K}\mathrm{HFFN}_{r,i}\left(\mathbf{r}_{t,i}\right)$. The output of $\mathrm{MiCE}^{N_s}_{N_r}$ is given by,
54+
$$
55+
\begin{equation}
56+
\begin{split}
57+
\mathrm{MiCE}^{N_s}_{N_r}(\mathbf{x}_t)= \mathbf{x}_t\oplus_\mathcal{L}&\left(\frac{\sum_{i=1}^{N_s}\mathbf{y}_{t,i} +\sum_{i=1}^{N_r} \mathbf{z}_{t,i}}{\sqrt{-K}|\|\sum_{i=1}^{N_s}\mathbf{y}_{t,i} +\sum_{i=1}^{N_r} \mathbf{z}_{t,i}\||_\mathcal{L}} \right)
58+
\end{split}.
59+
\end{equation}
60+
$$
61+
The constants $\sqrt{K_{s,i}/K}, \sqrt{K_{r,i}/K}$ project from the experts' manifolds to the input manifold, ensuring that the output of each shared and routed expert lives on the same manifold.
62+
63+
### 2.2 Hyperbolic Rotary Positional Encoding
64+
65+
Given $T$ tokens $\mathbf{X}$, where $\mathbf{x}_i\in \mathbb{L}^{K,d}$ ($d$ even), let $\mathbf{Q}, \mathbf{K}$ be the hyperbolic queries and keys. The hyperbolic rotary positional encoding applied to the $i$-th token is,
66+
$$
67+
\begin{equation}
68+
\mathrm{HoPE}({\mathbf{z}_i}) = \begin{bmatrix}
69+
\sqrt{\|\mathbf{R}_{i,\Theta}({\mathbf{z}_i})_s\|^2-1/K}\\ \mathbf{R}_{i,\Theta}({\mathbf{z}_i})_s
70+
\end{bmatrix}; \mathbf{R}_{i, \Theta}=\begin{pmatrix}
71+
\mathbf{R}_{i,\theta_1} & 0 & 0 & \ldots & 0\\
72+
0 & \mathbf R_{i,\theta_2} & 0 & \ldots & 0\\
73+
\vdots & \vdots & \ddots & \ldots & 0\\
74+
0 & 0 &\ldots &\ldots & \mathbf R_{i, \theta_{d/2}}
75+
\end{pmatrix},
76+
\end{equation}
77+
$$
78+
where $\mathbf{R}_{i,\theta_l}$ is the 2D rotation matrix parameterized by angle $i\theta_l$ and $\mathbf{z}$ can be a query $\mathbf{q}_i$ or key $\mathbf{k}_j$.
79+
80+
#### 2.2.1 Theoretical Guarantees
81+
HoPE comes with a slew of theoretical guarantees that matches the Euclidean counterpart, including long range decay, robustness to arbitrary token distances, and the ability to learn both diagonal and off diagonal attention patterns.
82+
83+
### 2.3 Hyperbolic Multi-Head Latent Attention
84+
85+
![HMLA Visualization|scale=0.35](./assets/HMLA.png)
86+
87+
Roughly, HMLA first maps the token and keys to latent vector before upward-projecting back into the original dimension for hyperbolic attention score computation. During inference, HMLA only requires caching the latent key-value pairs. As a result, the memory footprint for HMLA is $O((n_{kv} + n_r)L)$, where $L$ is the number of layers and $n_r, n_{kv}$ are latent dimensions. In contrast, the hyperbolic self-attention used in previous hyperbolic Transformers requires storing the full-sized keys and values, resulting in a memory complexity of $O(2nn_hL)$, where $n$ is the number of heads and $n_h$ is the dimension per head. By choosing $n_{kv}, n_r\ll nn_h$, we have $(n_{kv}+n_r)\ll 2nn_h$, resulting in a $\textbf{significantly smaller memory footprint}$ while maintaining the same time complexity of $O((nn_h)^2)$. Additionally, the latent query projection also results in smaller active footprint during training. This collective mechanism enables far greater scalability.
88+
89+
---
90+
91+
## 3. HELM Model Architecture
92+
---
93+
94+
![HELM Architecture|scale=0.5](./assets/HELM.png)
95+
96+
We introduce two variants of the HELM family of hyperbolic large language models: HELM-D the dense model, and HELM-MiCE, a hyperbolic MoE LLM usings out MiCE module. The overall architecture is as follows: tokenized text is first mapped to learned hyperbolic word embeddings, which are then passed through a series of hyperbolic decoder blocks, each consisting of two components: 1) the attention component, where the embeddings are normalized by a hyperbolic RMSNorm layer, then processed by an attention block such as HMLA or self-attention, and finally added to the embeddings through Lorentzian residual connection; and 2) the HFFN component, where the processed embeddings are again normalized by hyperbolic RMSNorm before being passed through a HFFN block and residually added to the output of the attention block. For HELM-MiCE, the HFFN block can either be a dense block or a MiCE block. The output of the final decoder block is then normalized once again before projected to logits for next-token prediction.
97+
98+
---
99+
100+
## 4. Experiments
101+
102+
### 4.1 Multi-QA benchmarks
103+
---
104+
105+
![Multi-QA Becnhmark Performance|scale=0.8](./assets/qa_exp.png)
106+
107+
We test the performance of HELM on Multi-QA benchmarks against popular Euclidean LLM architectures, namely LLaMA and DeepSeekV3, for 100M and 1B scales where all models are trained on the Euclidean Wikipedia dataset. HELM models consistently outperform Euclidean counterparts for up to 4%.
108+
109+
### 4.2 Ablation: MiCE v.s. Constant Curvature spaces
110+
---
111+
112+
![MiCE Ablation Becnhmark Performance|scale=0.8](./assets/mice_ab.png)
113+
We test the effectiveness of MiCE against a variant of HELM-MiCE where all experts are kept at the same curvature value, names -1, denoted as MiCE-Const. HELM-MiCE consistently outperforms MiCE-Const on Multi-QA benchmarks.
114+
115+
### 4.3 Qualitative Study: Semantic Hierarchy Modeling
116+
---
117+
118+
![Embedding Norm v.s. Word Specificity|scale=0.8](./assets/qualitative.png)
119+
we provide case studies for HELM-MiCE 1B and DeepseekV3 1B, where we show embedding norm in the final layers for words of varying levels of specificity (top table) and for a sample question taken from the MMLU benchmark (bottom table). For HELM-MiCE, more generic words (e.g., subject) are clustered closer to the origin than more specific words (e.g., biology), which has a smaller norm than even more specific words (e.g., photosynthesis). However, this does not necessarily hold for the DeepseekV3 1B model, demonstrating how HELM-MiCE better handles semantic hierarchies.

config/publications.ts

Lines changed: 11 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -20,6 +20,17 @@ export interface Publication {
2020
}
2121

2222
export const publications: Publication[] = [
23+
{
24+
title: "HELM: Hyperbolic Large Language Models via Mixture-of-Curvature Experts",
25+
authors: "Neil He, Rishabh Anand, Hiren Madhu, Ali Maatouk, Smita Krishnaswamy, Leandros Tassiulas, Menglin Yang, Rex Ying",
26+
venue: "NeurIPS 2025",
27+
page: "helm",
28+
code: "https://github.com/Graph-and-Geometric-Learning/helm",
29+
paper: "https://arxiv.org/pdf/2505.24722",
30+
abstract: "Natural language exhibits hierarchical and non-Euclidean structure that conventional Euclidean LLMs fail to model adequately, resulting in representational limitations and training instabilities. HELM introduces a family of fully hyperbolic large language models designed to align model geometry with the inherent structure of text. The framework includes both a dense variant (HELM-D) and a Mixture-of-Curvature Experts model (HELM-MICE), supported by newly developed hyperbolic counterparts of RoPE, RMSNorm, and an efficient hyperbolic latent attention mechanism. HELM constitutes the first successful scaling of fully hyperbolic LLMs to the billion-parameter regime. Empirical evaluations on benchmarks such as MMLU and ARC demonstrate consistent improvements—up to 4%—over comparable Euclidean architectures, indicating the advantages of hyperbolic geometry for large-scale language modeling.",
31+
impact: "HELM introduces fully hyperbolic large language models and proposes several new modules: hyperbolic RoPE, RMSNorm, and Mixture-of-Curvature Experts. At the scale of 1B parameters, HELM achieve geometry-aligned token representations and fine-grained curvature-aware modeling, consitstenly improving over Euclidean LLMs on reasoning benchmarks such as MMLU and ARC.",
32+
tags: [Tag.MultiModalFoundationModel],
33+
},
2334
{
2435
title: "TRACE: Grounding Time Series in Context for Multimodal Embedding and Retrieval",
2536
authors: "Jialin Chen, Ziyu Zhao, Gaukhar Nurbek, Aosong Feng, Ali Maatouk, Leandros Tassiulas, Yifeng Gao, Rex Ying",

0 commit comments

Comments
 (0)