Skip to content

Hybrid Hypergraph + Transformer approach#541

Closed
immartian wants to merge 2 commits intoopenai:mainfrom
immartian:hybrid-hypergraph-transformer
Closed

Hybrid Hypergraph + Transformer approach#541
immartian wants to merge 2 commits intoopenai:mainfrom
immartian:hybrid-hypergraph-transformer

Conversation

@immartian
Copy link
Copy Markdown

Summary

A novel two-phase training approach that combines a binding-energy-weighted pattern store (epistemic hypergraph) with a residual transformer, generalizing BigramHash to a principled multi-resolution hierarchy.

Architecture (16MB budget)

  • ~5MB: Hypergraph pattern store — 3 Cantor levels (bigrams, trigrams, 5-grams), patterns selected by binding energy B(C), noise automatically dropped
  • ~9MB: Residual transformer (9L, 512d, int8+zlib) trained with Muon+Adam
  • ~2MB: Code + overhead

How it works

  • Phase 1 (~90s): Scan FineWeb shards to build multi-level pattern tables. Each pattern's binding energy — computed from token specificity × prediction certainty × evidence mass — determines inclusion. Zero-binding patterns are filtered.
  • Phase 2 (~8 min): Train standard transformer as residual predictor
  • Eval: P(next) = λ·P_hyper + (1-λ)·P_neural, where λ scales with binding confidence

Key insight

At 16MB, neural nets approximate everything through weights. BigramHash already proved that exact pattern storage beats approximation for bigrams. We generalize this: store multi-resolution patterns exactly where binding energy is high, let the transformer handle the rest.

Test results

  • 30/30 theory tests (Cantor enrichment, binding energy, COMPRESS invariants)
  • 29/29 hybrid system tests (pattern store, prediction, serialization, budget)
  • Proof-of-concept on synthetic data: hybrid beats pure neural on structured patterns
  • Awaiting GPU time to validate on real FineWeb

Files

File Description
hypergraph_lm.py Standalone pattern store module (pure Python + numpy)
train_hybrid.py Full training script: Phase 1 scan + Phase 2 train + hybrid eval
test/test_hybrid_system.py 37 tests for the hybrid system
test/test_cantor_emergence.py 30 tests for the underlying theory
test/cantor_emergence_proof.py Mini-dataset proof of concept

Test plan

  • Run on 8xH100 with real FineWeb data
  • Measure BPB improvement from hybrid interpolation vs pure neural
  • Tune budget split (store vs transformer) and λ threshold
  • Compare with baseline BigramHash approach
  • Optimize Phase 1 scan time (<90s target)

im added 2 commits March 23, 2026 11:20
Two-phase training system that combines a binding-energy-weighted
pattern store (hypergraph) with a residual transformer:

Phase 1: Scan FineWeb to build multi-level pattern store
  - Ω₁ bigrams, Ω₂ trigrams, Ω₃ 5-grams
  - Patterns selected by binding energy B(C)
  - Noise (B=0) automatically dropped

Phase 2: Train residual transformer (standard architecture)

Eval: P(next) = λ·P_hyper + (1-λ)·P_neural
  - Generalizes BigramHash to principled multi-resolution hierarchy

Files:
  - hypergraph_lm.py: standalone pattern store module
  - train_hybrid.py: full training script (Phase 1 + Phase 2 + hybrid eval)
  - test/test_hybrid_system.py: 37 tests (29 pass, 8 need torch)
  - test/test_cantor_emergence.py: 30 theory tests
  - test/cantor_emergence_proof.py: mini-dataset proof of concept
…rparams

Scan optimization:
- Replace Counter(.tolist()) with np.unique (83x speedup)
- Skip singletons in trigram/5-gram counting (7x additional)
- 1M tokens: 1415s → 2.5s (560x total)

New features:
- Sliding window eval (stride=64): ~0.03 BPB free improvement
- GPUPatternLookup: tensor-based pattern matching, no CPU↔GPU roundtrips
- Leaderboard hyperparams: 10L, 3x MLP, WD=0.04, Muon warmup 0.92→0.99

59 tests pass, 8 skipped (torch-dependent).
@immartian
Copy link
Copy Markdown
Author

Update: Pivoting toward CTW integration

Given the dramatic results from PR #986 (0.0830 BPB via Dirichlet CTW), the competition landscape has shifted fundamentally. Pure n-gram compression with principled Bayesian mixing dominates neural approaches at 16MB.

Our hypergraph binding energy framework has a natural fit here: use binding energy as the Dirichlet concentration prior. Instead of a fixed concentration c=5.0, the CTW mixing weights could be modulated by the structural coherence of each n-gram context — rare, high-specificity patterns get higher concentration (more trust in the n-gram), common patterns get lower concentration (more smoothing toward the neural/lower-order backup).

This is essentially what our theory predicts: B(C) = specificity × predictability × evidence mass. The CTW framework provides the optimal mixing; our binding energy provides a context-dependent prior for that mixing.

Optimizations shipped in latest commit:

  • Scan speed: 560x faster (1415s → 2.5s per 1M tokens)
  • Sliding window eval
  • GPU-accelerated pattern lookup
  • Leaderboard hyperparams (10L, 3x MLP, WD=0.04)

Next steps:

  • Integrate Dirichlet CTW mixing (replacing our sigmoid interpolation)
  • Use binding energy as context-dependent concentration parameter
  • Test on real FineWeb data (seeking compute access)

@immartian
Copy link
Copy Markdown
Author

Result: Binding-modulated CTW beats fixed CTW by 35%

We now have an empirical proof that context-aware Dirichlet concentration outperforms fixed concentration.

The formula

Replace fixed c=5.0 in the hierarchical Dirichlet with:

c_eff = c_base / (1 + β × log(ctx_count) × specificity_boost)

where specificity_boost = avg IDF of context tokens.

  • High counts + rare context → very low c → trust the n-gram counts
  • Low counts + common context → c ≈ c_base → smooth toward backup

Results (synthetic corpus, 200K tokens, two regimes)

Method All Rare ctx Common ctx
Uniform 10.00 10.00 10.00
Fixed CTW (c=5.0) 1.051 1.519 1.087
Binding CTW 0.687 0.976 0.720

35% improvement overall. Wins on both rare (deterministic) and common (ambiguous) contexts.

Why it works

The self-model thesis: a compression scheme that knows its own reliability outperforms one that doesn't. Fixed c=5.0 applies the same smoothing to a context that appeared 10,000 times (very reliable) and one that appeared 3 times (noisy). Our formula adapts: log(10000) × high_specificity → c≈0.3 vs log(3) × low_specificity → c≈4.8.

Next step

Test this on actual FineWeb data by plugging lookup_hierarchical_binding into PR #986's two-pass eval pipeline. The improvement should transfer since the mechanism is independent of corpus content.

Code: binding_ctw.py + test/proof_binding_beats_fixed.py

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants