Hybrid Hypergraph + Transformer approach#541
Conversation
Two-phase training system that combines a binding-energy-weighted pattern store (hypergraph) with a residual transformer: Phase 1: Scan FineWeb to build multi-level pattern store - Ω₁ bigrams, Ω₂ trigrams, Ω₃ 5-grams - Patterns selected by binding energy B(C) - Noise (B=0) automatically dropped Phase 2: Train residual transformer (standard architecture) Eval: P(next) = λ·P_hyper + (1-λ)·P_neural - Generalizes BigramHash to principled multi-resolution hierarchy Files: - hypergraph_lm.py: standalone pattern store module - train_hybrid.py: full training script (Phase 1 + Phase 2 + hybrid eval) - test/test_hybrid_system.py: 37 tests (29 pass, 8 need torch) - test/test_cantor_emergence.py: 30 theory tests - test/cantor_emergence_proof.py: mini-dataset proof of concept
…rparams Scan optimization: - Replace Counter(.tolist()) with np.unique (83x speedup) - Skip singletons in trigram/5-gram counting (7x additional) - 1M tokens: 1415s → 2.5s (560x total) New features: - Sliding window eval (stride=64): ~0.03 BPB free improvement - GPUPatternLookup: tensor-based pattern matching, no CPU↔GPU roundtrips - Leaderboard hyperparams: 10L, 3x MLP, WD=0.04, Muon warmup 0.92→0.99 59 tests pass, 8 skipped (torch-dependent).
Update: Pivoting toward CTW integrationGiven the dramatic results from PR #986 (0.0830 BPB via Dirichlet CTW), the competition landscape has shifted fundamentally. Pure n-gram compression with principled Bayesian mixing dominates neural approaches at 16MB. Our hypergraph binding energy framework has a natural fit here: use binding energy as the Dirichlet concentration prior. Instead of a fixed concentration c=5.0, the CTW mixing weights could be modulated by the structural coherence of each n-gram context — rare, high-specificity patterns get higher concentration (more trust in the n-gram), common patterns get lower concentration (more smoothing toward the neural/lower-order backup). This is essentially what our theory predicts: B(C) = specificity × predictability × evidence mass. The CTW framework provides the optimal mixing; our binding energy provides a context-dependent prior for that mixing. Optimizations shipped in latest commit:
Next steps:
|
Result: Binding-modulated CTW beats fixed CTW by 35%We now have an empirical proof that context-aware Dirichlet concentration outperforms fixed concentration. The formulaReplace fixed where
Results (synthetic corpus, 200K tokens, two regimes)
35% improvement overall. Wins on both rare (deterministic) and common (ambiguous) contexts. Why it worksThe self-model thesis: a compression scheme that knows its own reliability outperforms one that doesn't. Fixed c=5.0 applies the same smoothing to a context that appeared 10,000 times (very reliable) and one that appeared 3 times (noisy). Our formula adapts: Next stepTest this on actual FineWeb data by plugging Code: |
Summary
A novel two-phase training approach that combines a binding-energy-weighted pattern store (epistemic hypergraph) with a residual transformer, generalizing BigramHash to a principled multi-resolution hierarchy.
Architecture (16MB budget)
How it works
P(next) = λ·P_hyper + (1-λ)·P_neural, where λ scales with binding confidenceKey insight
At 16MB, neural nets approximate everything through weights. BigramHash already proved that exact pattern storage beats approximation for bigrams. We generalize this: store multi-resolution patterns exactly where binding energy is high, let the transformer handle the rest.
Test results
Files
hypergraph_lm.pytrain_hybrid.pytest/test_hybrid_system.pytest/test_cantor_emergence.pytest/cantor_emergence_proof.pyTest plan