Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
145 changes: 145 additions & 0 deletions DIGEST_SYSTEM.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,145 @@
# Autoresearch Daily Digest System

This document describes the daily research digest and leaderboard system for autoresearch experiments.

## Overview

The digest system provides:
- **📊 Run leaderboard** by complexity-adjusted score (val_bpb + parameter penalty)
- **📈 Trend charts** showing validation BPB over time
- **📋 Ready-for-review summary** templates for quick status updates

## Components

### 1. `log_run.py` - Results Logger
Captures experiment results and appends to `results.tsv`.

```bash
# After running training
uv run train.py 2>&1 | tee last_run.log
python3 log_run.py last_run.log

# Or log manually
python3 log_run.py --val_bpb 2.345 --num_params_M 12.5 --status success
```

### 2. `digest.py` - Report Generator
Generates HTML digest reports from results data.

```bash
# Generate today's digest
python3 digest.py

# Generate for specific date
python3 digest.py --date 2026-03-15 --output digest_2026-03-15.html
```

### 3. `experiment_runner.py` - Automation
High-level interface for running experiments and generating digests.

```bash
# Run single experiment
python3 experiment_runner.py run

# Generate digest
python3 experiment_runner.py digest

# Run autonomous research loop
python3 experiment_runner.py loop --max-runs 10 --pause 10
```

## Results Schema

The `results.tsv` file contains columns:
- `timestamp`: ISO format timestamp
- `commit_hash`: Git commit (short hash)
- `val_bpb`: Validation bits per byte (main metric)
- `num_params_M`: Number of parameters in millions
- `training_seconds`: Training time
- `total_seconds`: Total runtime
- `peak_vram_mb`: Peak VRAM usage
- `mfu_percent`: Model FLOP utilization percentage
- `total_tokens_M`: Total tokens processed
- `num_steps`: Training steps completed
- `status`: success/crash/timeout

## Leaderboard Scoring

### Complexity-Adjusted Score
```
score = val_bpb + (num_params_M / 1000) * 0.001
```
Lower scores are better. This penalizes larger models slightly to encourage efficient architectures.

### Ranking Strategy
1. **Primary**: Complexity-adjusted score (lower is better)
2. **Filter**: Only successful runs (`status == "success"`)
3. **Tie-breaker**: Raw val_bpb

## Daily Workflow

### For Human Researchers
```bash
# Run experiment and log results
python3 experiment_runner.py run

# Generate morning digest
python3 experiment_runner.py digest
```

### For Autonomous Agents
```bash
# Run overnight research loop
python3 experiment_runner.py loop --max-runs 20 --pause 15
```

## HTML Report Features

- **Summary metrics**: Total/successful/failed runs, best val_bpb
- **Ranked leaderboard**: Top 10 runs by complexity-adjusted score
- **Visual indicators**: Best run highlighting, status colors
- **Trend visualization**: val_bpb over time (requires matplotlib)
- **Ready-for-review template**: Structured summary for quick reporting

## Integration with Autonomous Research

The digest system fits into the autoresearch autonomous loop:

1. **Agent modifies** `train.py`
2. **System runs** training (`uv run train.py`)
3. **Logger captures** results (`log_run.py`)
4. **Policy engine decides** keep/discard (`policy_engine.py`)
5. **Digest generates** progress report (`digest.py`)

This creates an audit trail of all experiments with rich analytics for tracking research progress.

## Dependencies

- **Core**: Python 3.10+, no additional deps
- **Charts**: matplotlib (optional, for trend plots)
- **Training**: Same as autoresearch (PyTorch, etc.)

```bash
# For chart generation
pip install matplotlib
```

## File Organization

```
autoresearch/
├── digest.py # Report generator
├── log_run.py # Results logger
├── experiment_runner.py # High-level automation
├── results.tsv # Results database (gitignored)
└── digest_*.html # Generated reports
```

## Example Usage in Cron Jobs

```bash
# Run autonomous research and generate morning digest
0 6 * * * cd /path/to/autoresearch && python3 experiment_runner.py loop --max-runs 5 && python3 experiment_runner.py digest
```

The digest system enables both human oversight and autonomous research tracking at scale.
172 changes: 172 additions & 0 deletions SMALL_COMPUTE_ADAPTATION.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,172 @@
# Small-Compute Adaptation Plan (Mac-First)

**Target:** Non-H100 environments, prioritizing MacBooks with Apple Silicon

This document provides a systematic adaptation strategy for running autoresearch on small compute environments, based on the guidance from the main README and analysis of notable platform forks.

## Parameter Downsizing Matrix

### Core Architecture Parameters

| Environment | DEPTH | vocab_size | MAX_SEQ_LEN | TOTAL_BATCH_SIZE | DEVICE_BATCH_SIZE | Window Pattern |
|-------------|-------|------------|-------------|------------------|-------------------|----------------|
| **H100 (Baseline)** | 8 | 8192 | 2048 | 2^19 (~524K) | 128 | SSSL |
| **MacBook Pro M1/M2** | 4 | 4096 | 512 | 2^16 (~65K) | 32 | L |
| **MacBook Pro M3/M4** | 6 | 4096 | 768 | 2^17 (~131K) | 64 | L |
| **MacBook Air** | 3 | 2048 | 256 | 2^15 (~32K) | 16 | L |
| **CPU Only** | 2 | 1024 | 256 | 2^14 (~16K) | 8 | L |

### Rationale for Downsizing

**DEPTH Reduction**: Primary complexity knob. Reducing from 8→4 layers cuts ~50% of compute while maintaining reasonable learning capacity.

**vocab_size**: Smaller vocabulary (8192→4096) reduces embedding table memory and final layer computation. Consider byte-level tokenization (256) for ultra-low resource.

**MAX_SEQ_LEN**: Dramatic reduction (2048→512) cuts quadratic attention cost. Memory scales O(seq_len²) for attention.

**TOTAL_BATCH_SIZE**: Maintain powers of 2. Smaller batches mean noisier gradients but faster iteration.

**Window Pattern**: "SSSL" alternating pattern inefficient on small compute; "L" (local attention only) is simpler and faster.

## Dataset Recommendations

### Primary: TinyStories (Low Entropy)
```python
# Replace in prepare.py for small compute
BASE_URL = "https://huggingface.co/datasets/karpathy/tinystories-gpt4-clean"
```

**Why TinyStories**: GPT-4 generated short stories with narrow scope. Much lower entropy than web text means reasonable results with smaller models.

**Expected improvement**: 2-3x better perplexity on small models vs. general web text.

### Alternative Datasets by Compute Level

| Compute Level | Dataset | Characteristics |
|---------------|---------|-----------------|
| **MacBook Pro** | TinyStories | Low entropy, coherent sampling |
| **MacBook Air** | Simple Wikipedia | Medium complexity, factual |
| **CPU Only** | Children's books corpus | Very simple language patterns |

### Evaluation Token Adjustment

```python
# In prepare.py, reduce for small compute
EVAL_TOKENS = 10 * 524288 # 25% of original for MacBook
EVAL_TOKENS = 5 * 524288 # 12.5% for ultra-low resource
```

## Expected Throughput & Quality Envelope

### MacBook Pro M1 Max (32GB) Baseline Config
```
DEPTH = 4, vocab_size = 4096, MAX_SEQ_LEN = 512
```

**Expected Performance**:
- **Training speed**: ~2-3 minutes per experiment (vs 5min on H100)
- **Experiments per night**: ~160-240 (vs 96 on H100)
- **Memory usage**: ~8-12GB
- **Quality**: 0.5-1.0 bpb higher than H100 equivalent
- **Sample coherence**: Good on TinyStories, poor on web text

### MacBook Air M2 (16GB) Conservative Config
```
DEPTH = 3, vocab_size = 2048, MAX_SEQ_LEN = 256
```

**Expected Performance**:
- **Training speed**: ~90 seconds per experiment
- **Experiments per night**: ~320
- **Memory usage**: ~4-6GB
- **Quality**: 1.5-2.0 bpb higher than H100
- **Sample coherence**: Requires simple datasets

### Performance Scaling Estimates

| Metric | MacBook Pro M3 | MacBook Air M2 | CPU (16-core) |
|--------|----------------|----------------|---------------|
| **Tokens/sec** | ~20K | ~8K | ~1K |
| **Model params** | ~25M | ~8M | ~2M |
| **Peak memory** | ~12GB | ~6GB | ~4GB |
| **Experiments/hour** | ~20 | ~40 | ~4 |

## Implementation Strategy

### Phase 1: MacBook Pro Adaptation (Priority)

1. **Fork Selection**: Start with [miolini/autoresearch-macos](https://github.com/miolini/autoresearch-macos) as base
2. **Parameter Update**: Apply MacBook Pro config from matrix above
3. **Dataset Switch**: Implement TinyStories dataset download
4. **Memory Optimization**: Add memory monitoring and automatic batch size reduction

### Phase 2: MacBook Air Support

1. **Memory Constraints**: Implement dynamic memory detection
2. **Ultra-low Config**: Test 3-layer models with byte-level tokenization
3. **Checkpoint Strategy**: More frequent saves due to thermal throttling risk

### Phase 3: Quality Validation

1. **Baseline Comparison**: Run identical configs on H100 vs MacBook
2. **Convergence Analysis**: Document quality vs speed tradeoffs
3. **Sample Quality**: Human evaluation of generated text across configs

## Notable Fork Analysis

### [miolini/autoresearch-macos](https://github.com/miolini/autoresearch-macos)
- **Focus**: Metal Performance Shaders (MPS) backend
- **Status**: Active, good Mac compatibility
- **Recommendation**: Primary base fork

### [trevin-creator/autoresearch-mlx](https://github.com/trevin-creator/autoresearch-mlx)
- **Focus**: Apple MLX framework
- **Advantage**: Native Apple Silicon optimization
- **Risk**: Newer/less tested framework
- **Recommendation**: Experimental track

### [jsegov/autoresearch-win-rtx](https://github.com/jsegov/autoresearch-win-rtx)
- **Focus**: Windows + RTX GPUs
- **Relevance**: Good for cross-platform parameter validation
- **Recommendation**: Reference for GPU memory strategies

## Configuration Templates

### MacBook Pro Template (train.py changes)
```python
# Optimized for M1/M2 MacBook Pro
TOTAL_BATCH_SIZE = 2**16 # ~65K tokens
DEPTH = 4 # half the layers
DEVICE_BATCH_SIZE = 32 # reduced for MPS
WINDOW_PATTERN = "L" # local attention only
```

### MacBook Pro Template (prepare.py changes)
```python
MAX_SEQ_LEN = 512 # quarter the context
EVAL_TOKENS = 10 * 524288 # faster evaluation
VOCAB_SIZE = 4096 # half the vocabulary
```

## Success Metrics

**Deployment Success**:
- [ ] Successful 5-minute training run without OOM
- [ ] Agent can iterate autonomously overnight
- [ ] Memory usage under 80% of available RAM

**Quality Success**:
- [ ] val_bpb within 1.0 of equivalent H100 config
- [ ] Generated samples show coherent language patterns
- [ ] Research progress measurable over 50+ experiments

## Next Steps

1. **Immediate**: Create working MacBook Pro config using parameter matrix
2. **Week 1**: Validate quality envelope with TinyStories dataset
3. **Week 2**: Optimize for overnight autonomous research sessions
4. **Month 1**: Scale to MacBook Air and document lessons learned

---

This adaptation plan prioritizes getting functional autonomous research on MacBooks first, with clear parameters for expected performance tradeoffs. The goal is speed of iteration over absolute quality, letting the agent discover optimal configs within the smaller search space.
Loading