nmandal · nmandal · Mar 15, 2026 · Mar 15, 2026 · Mar 15, 2026
diff --git a/DIGEST_SYSTEM.md b/DIGEST_SYSTEM.md
@@ -0,0 +1,145 @@
+# Autoresearch Daily Digest System
+
+This document describes the daily research digest and leaderboard system for autoresearch experiments.
+
+## Overview
+
+The digest system provides:
+- **📊 Run leaderboard** by complexity-adjusted score (val_bpb + parameter penalty)
+- **📈 Trend charts** showing validation BPB over time
+- **📋 Ready-for-review summary** templates for quick status updates
+
+## Components
+
+### 1. `log_run.py` - Results Logger
+Captures experiment results and appends to `results.tsv`.
+
+```bash
+# After running training
+uv run train.py 2>&1 | tee last_run.log
+python3 log_run.py last_run.log
+
+# Or log manually
+python3 log_run.py --val_bpb 2.345 --num_params_M 12.5 --status success
+```
+
+### 2. `digest.py` - Report Generator
+Generates HTML digest reports from results data.
+
+```bash
+# Generate today's digest
+python3 digest.py
+
+# Generate for specific date
+python3 digest.py --date 2026-03-15 --output digest_2026-03-15.html
+```
+
+### 3. `experiment_runner.py` - Automation
+High-level interface for running experiments and generating digests.
+
+```bash
+# Run single experiment
+python3 experiment_runner.py run
+
+# Generate digest
+python3 experiment_runner.py digest
+
+# Run autonomous research loop
+python3 experiment_runner.py loop --max-runs 10 --pause 10
+```
+
+## Results Schema
+
+The `results.tsv` file contains columns:
+- `timestamp`: ISO format timestamp
+- `commit_hash`: Git commit (short hash)
+- `val_bpb`: Validation bits per byte (main metric)
+- `num_params_M`: Number of parameters in millions
+- `training_seconds`: Training time
+- `total_seconds`: Total runtime
+- `peak_vram_mb`: Peak VRAM usage
+- `mfu_percent`: Model FLOP utilization percentage
+- `total_tokens_M`: Total tokens processed
+- `num_steps`: Training steps completed
+- `status`: success/crash/timeout
+
+## Leaderboard Scoring
+
+### Complexity-Adjusted Score
+```
+score = val_bpb + (num_params_M / 1000) * 0.001
+```
+Lower scores are better. This penalizes larger models slightly to encourage efficient architectures.
+
+### Ranking Strategy
+1. **Primary**: Complexity-adjusted score (lower is better)
+2. **Filter**: Only successful runs (`status == "success"`)
+3. **Tie-breaker**: Raw val_bpb
+
+## Daily Workflow
+
+### For Human Researchers
+```bash
+# Run experiment and log results
+python3 experiment_runner.py run
+
+# Generate morning digest
+python3 experiment_runner.py digest
+```
+
+### For Autonomous Agents
+```bash
+# Run overnight research loop
+python3 experiment_runner.py loop --max-runs 20 --pause 15
+```
+
+## HTML Report Features
+
+- **Summary metrics**: Total/successful/failed runs, best val_bpb
+- **Ranked leaderboard**: Top 10 runs by complexity-adjusted score
+- **Visual indicators**: Best run highlighting, status colors
+- **Trend visualization**: val_bpb over time (requires matplotlib)
+- **Ready-for-review template**: Structured summary for quick reporting
+
+## Integration with Autonomous Research
+
+The digest system fits into the autoresearch autonomous loop:
+
+1. **Agent modifies** `train.py`
+2. **System runs** training (`uv run train.py`)
+3. **Logger captures** results (`log_run.py`)
+4. **Policy engine decides** keep/discard (`policy_engine.py`)
+5. **Digest generates** progress report (`digest.py`)
+
+This creates an audit trail of all experiments with rich analytics for tracking research progress.
+
+## Dependencies
+
+- **Core**: Python 3.10+, no additional deps
+- **Charts**: matplotlib (optional, for trend plots)
+- **Training**: Same as autoresearch (PyTorch, etc.)
+
+```bash
+# For chart generation
+pip install matplotlib
+```
+
+## File Organization
+
+```
+autoresearch/
+├── digest.py              # Report generator
+├── log_run.py             # Results logger
+├── experiment_runner.py   # High-level automation
+├── results.tsv            # Results database (gitignored)
+└── digest_*.html          # Generated reports
+```
+
+## Example Usage in Cron Jobs
+
+```bash
+# Run autonomous research and generate morning digest
+0 6 * * * cd /path/to/autoresearch && python3 experiment_runner.py loop --max-runs 5 && python3 experiment_runner.py digest
+```
+
+The digest system enables both human oversight and autonomous research tracking at scale.
diff --git a/SMALL_COMPUTE_ADAPTATION.md b/SMALL_COMPUTE_ADAPTATION.md
@@ -0,0 +1,172 @@
+# Small-Compute Adaptation Plan (Mac-First)
+
+**Target:** Non-H100 environments, prioritizing MacBooks with Apple Silicon
+
+This document provides a systematic adaptation strategy for running autoresearch on small compute environments, based on the guidance from the main README and analysis of notable platform forks.
+
+## Parameter Downsizing Matrix
+
+### Core Architecture Parameters
+
+| Environment | DEPTH | vocab_size | MAX_SEQ_LEN | TOTAL_BATCH_SIZE | DEVICE_BATCH_SIZE | Window Pattern |
+|-------------|-------|------------|-------------|------------------|-------------------|----------------|
+| **H100 (Baseline)** | 8 | 8192 | 2048 | 2^19 (~524K) | 128 | SSSL |
+| **MacBook Pro M1/M2** | 4 | 4096 | 512 | 2^16 (~65K) | 32 | L |
+| **MacBook Pro M3/M4** | 6 | 4096 | 768 | 2^17 (~131K) | 64 | L |
+| **MacBook Air** | 3 | 2048 | 256 | 2^15 (~32K) | 16 | L |
+| **CPU Only** | 2 | 1024 | 256 | 2^14 (~16K) | 8 | L |
+
+### Rationale for Downsizing
+
+**DEPTH Reduction**: Primary complexity knob. Reducing from 8→4 layers cuts ~50% of compute while maintaining reasonable learning capacity.
+
+**vocab_size**: Smaller vocabulary (8192→4096) reduces embedding table memory and final layer computation. Consider byte-level tokenization (256) for ultra-low resource.
+
+**MAX_SEQ_LEN**: Dramatic reduction (2048→512) cuts quadratic attention cost. Memory scales O(seq_len²) for attention.
+
+**TOTAL_BATCH_SIZE**: Maintain powers of 2. Smaller batches mean noisier gradients but faster iteration.
+
+**Window Pattern**: "SSSL" alternating pattern inefficient on small compute; "L" (local attention only) is simpler and faster.
+
+## Dataset Recommendations
+
+### Primary: TinyStories (Low Entropy)
+```python
+# Replace in prepare.py for small compute
+BASE_URL = "https://huggingface.co/datasets/karpathy/tinystories-gpt4-clean"
+```
+
+**Why TinyStories**: GPT-4 generated short stories with narrow scope. Much lower entropy than web text means reasonable results with smaller models.
+
+**Expected improvement**: 2-3x better perplexity on small models vs. general web text.
+
+### Alternative Datasets by Compute Level
+
+| Compute Level | Dataset | Characteristics |
+|---------------|---------|-----------------|
+| **MacBook Pro** | TinyStories | Low entropy, coherent sampling |
+| **MacBook Air** | Simple Wikipedia | Medium complexity, factual |
+| **CPU Only** | Children's books corpus | Very simple language patterns |
+
+### Evaluation Token Adjustment
+
+```python
+# In prepare.py, reduce for small compute
+EVAL_TOKENS = 10 * 524288  # 25% of original for MacBook
+EVAL_TOKENS = 5 * 524288   # 12.5% for ultra-low resource
+```
+
+## Expected Throughput & Quality Envelope
+
+### MacBook Pro M1 Max (32GB) Baseline Config
+```
+DEPTH = 4, vocab_size = 4096, MAX_SEQ_LEN = 512
+```
+
+**Expected Performance**:
+- **Training speed**: ~2-3 minutes per experiment (vs 5min on H100)
+- **Experiments per night**: ~160-240 (vs 96 on H100)  
+- **Memory usage**: ~8-12GB
+- **Quality**: 0.5-1.0 bpb higher than H100 equivalent
+- **Sample coherence**: Good on TinyStories, poor on web text
+
+### MacBook Air M2 (16GB) Conservative Config
+```
+DEPTH = 3, vocab_size = 2048, MAX_SEQ_LEN = 256  
+```
+
+**Expected Performance**:
+- **Training speed**: ~90 seconds per experiment
+- **Experiments per night**: ~320
+- **Memory usage**: ~4-6GB
+- **Quality**: 1.5-2.0 bpb higher than H100
+- **Sample coherence**: Requires simple datasets
+
+### Performance Scaling Estimates
+
+| Metric | MacBook Pro M3 | MacBook Air M2 | CPU (16-core) |
+|--------|----------------|----------------|---------------|
+| **Tokens/sec** | ~20K | ~8K | ~1K |
+| **Model params** | ~25M | ~8M | ~2M |
+| **Peak memory** | ~12GB | ~6GB | ~4GB |
+| **Experiments/hour** | ~20 | ~40 | ~4 |
+
+## Implementation Strategy
+
+### Phase 1: MacBook Pro Adaptation (Priority)
+
+1. **Fork Selection**: Start with [miolini/autoresearch-macos](https://github.com/miolini/autoresearch-macos) as base
+2. **Parameter Update**: Apply MacBook Pro config from matrix above
+3. **Dataset Switch**: Implement TinyStories dataset download
+4. **Memory Optimization**: Add memory monitoring and automatic batch size reduction
+
+### Phase 2: MacBook Air Support 
+
+1. **Memory Constraints**: Implement dynamic memory detection
+2. **Ultra-low Config**: Test 3-layer models with byte-level tokenization
+3. **Checkpoint Strategy**: More frequent saves due to thermal throttling risk
+
+### Phase 3: Quality Validation
+
+1. **Baseline Comparison**: Run identical configs on H100 vs MacBook
+2. **Convergence Analysis**: Document quality vs speed tradeoffs  
+3. **Sample Quality**: Human evaluation of generated text across configs
+
+## Notable Fork Analysis
+
+### [miolini/autoresearch-macos](https://github.com/miolini/autoresearch-macos)
+- **Focus**: Metal Performance Shaders (MPS) backend
+- **Status**: Active, good Mac compatibility
+- **Recommendation**: Primary base fork
+
+### [trevin-creator/autoresearch-mlx](https://github.com/trevin-creator/autoresearch-mlx)  
+- **Focus**: Apple MLX framework
+- **Advantage**: Native Apple Silicon optimization
+- **Risk**: Newer/less tested framework
+- **Recommendation**: Experimental track
+
+### [jsegov/autoresearch-win-rtx](https://github.com/jsegov/autoresearch-win-rtx)
+- **Focus**: Windows + RTX GPUs
+- **Relevance**: Good for cross-platform parameter validation
+- **Recommendation**: Reference for GPU memory strategies
+
+## Configuration Templates
+
+### MacBook Pro Template (train.py changes)
+```python
+# Optimized for M1/M2 MacBook Pro
+TOTAL_BATCH_SIZE = 2**16    # ~65K tokens
+DEPTH = 4                   # half the layers
+DEVICE_BATCH_SIZE = 32      # reduced for MPS
+WINDOW_PATTERN = "L"        # local attention only
+```
+
+### MacBook Pro Template (prepare.py changes)  
+```python
+MAX_SEQ_LEN = 512           # quarter the context
+EVAL_TOKENS = 10 * 524288   # faster evaluation
+VOCAB_SIZE = 4096           # half the vocabulary
+```
+
+## Success Metrics
+
+**Deployment Success**:
+- [ ] Successful 5-minute training run without OOM
+- [ ] Agent can iterate autonomously overnight  
+- [ ] Memory usage under 80% of available RAM
+
+**Quality Success**:
+- [ ] val_bpb within 1.0 of equivalent H100 config
+- [ ] Generated samples show coherent language patterns
+- [ ] Research progress measurable over 50+ experiments
+
+## Next Steps
+
+1. **Immediate**: Create working MacBook Pro config using parameter matrix
+2. **Week 1**: Validate quality envelope with TinyStories dataset  
+3. **Week 2**: Optimize for overnight autonomous research sessions
+4. **Month 1**: Scale to MacBook Air and document lessons learned
+
+---
+
+This adaptation plan prioritizes getting functional autonomous research on MacBooks first, with clear parameters for expected performance tradeoffs. The goal is speed of iteration over absolute quality, letting the agent discover optimal configs within the smaller search space.