Experiments in learning interpretable text prompts via optimization in SONAR embedding space.
Two-stage optimization system:
- Stage 1: Optimize a z embedding vector that generates prompt tokens via SONAR decoder
- Stage 2: Use the generated prompt with z=0 (unconditioned) to solve tasks
Key technique: Straight-through gradient estimation via embedding geometry.
Using PPL regularization (weight=0.1) provides stability without dominating the task loss. The optimization achieves 67% accuracy on an antonym completion task, with specific examples (hot -> cold, happy -> sad) consistently failing.
uv run python scripts/optimize_prompt.pyscripts/
optimize_prompt.py # Main optimization script
src/prompt_interp/ # Package stub
papers/ # Reference papers (SONAR, EPO, ContextBench)
- Python 3.12+
- CUDA GPU
- SONAR (
sonar-space) - PyTorch
Install dependencies:
uv syncIdeas to try:
-
[done] Add perplexity term.
-
[done] Each iteration, update z to be the re-encoding.
-
PCA with different learning rates.
-
Test if jailbreaks transfer to normal TinyStories models