Goal: Learn cross-modal associations between vision and audio
- Hebbian association matrix (512x512 bilinear map)
- DINOv2 visual encoder (384-dim)
- Whisper audio encoder (512-dim)
- Sparse random projection (384/512 → 512 sparse)
- VGGSound dataset (24,604 clips)
- Gradient InfoNCE training (replaced broken Hebbian approximation)
- Symmetric bidirectional loss (V→A + A→V)
- Autonomous self-improvement (Cortex daemon + LLM mutations)
- Web dashboard with goals tracking
- YouTube video interaction
Results: v2a_MRR=0.55, a2v_MRR=0.54, 1,240+ experiments
Goal: The brain can describe what it perceives and answer questions
-
/api/brain/describe— describe associations for an input -
/api/brain/ask— answer questions about learned patterns with cross-modal associations -
/api/brain/chat— conversational interface with personality - Chat UI on the web dashboard (
/chatpage) - Brain personality (speaks in first person about its associations)
- Conversation context (session-based history)
- Template-based voice engine (instant responses, no LLM latency)
- Stop word filtering + fuzzy stem matching for keyword search
- Cross-modal association retrieval (visual→audio bridge)
- YouTube video processing → brain perception description
Key insight: Template-based voice engine composes natural language from raw association scores instantly. LLM approach was abandoned due to CPU-only inference being too slow (~5s/token even for 0.5b model). The template engine gives personality and is immediate.
Goal: Process live audio/video and react instantly
- WebSocket endpoint for streaming audio
- Real-time encoding + association (< 2s latency)
- Live waveform visualization in browser
- Streaming association updates via SSE
- Browser microphone capture (Web Audio API)
- Live camera feed processing
Goal: Learn from new experiences, remember specific interactions
- Online gradient InfoNCE updates from new inputs
- Episodic memory store (specific interactions, not just aggregates)
- "What did you learn today?" summarization
- Learnable projection (replace random sparse projection)
- Expand beyond VGGSound — learn from user-provided content
- Memory consolidation (sleep-like offline replay)
Goal: Reason about relationships, not just similarities
- Graph of associations (beyond single bilinear matrix)
- Multi-hop reasoning: A sounds like B, B looks like C → A relates to C
- Concept formation — cluster similar associations into abstract categories
- Text embedding bridge (CLIP) for language-grounded queries
- Causal associations — "rain causes puddles" not just "rain co-occurs with puddles"
- Attention over association graph for complex queries
Goal: Self-directed exploration and communication
- Initiate conversation based on interesting patterns discovered
- Self-reflection on matrix changes ("I learned something new about...")
- Goal-directed perception — seek out inputs that fill knowledge gaps
- Multi-agent interaction — share associations with other brain instances
- Report mutation results and discoveries in natural language