Extend a small GPT-style tokenizer with curated Finnegans Wake morphemes, initialise new vectors by morpheme composition, and fine-tune with a two-phase protocol (embedding warm-up + full fine-tune or optional three-phase with LoRA) that protects stability. Report geometry shifts (top-k neighbour overlap, embedding norm deltas, isotropy and PIP loss if requested), language behaviour (validation loss and perplexity on held-out Wake slices), and qualitative intrusion. The full pipeline reproduces on a Colab T4.
Style control is often attempted through prompts or full fine-tuning. Wake2Vec explores a third path: an embedding-first intervention that inserts Joyce-specific forms and trains the input layer in a controlled way. The goal is local, interpretable changes to semantic neighbourhoods under tight compute, with results that can be verified and challenged.
A hand-curated CSV lists type ∈ {prefix, suffix}, morpheme, and up to 10 examples per row. Parsed into:
morph["prefixes"]: {prefix → [examples...]}morph["suffixes"]: {suffix → [examples...]}
Morpheme CSV format:
type,morpheme,example1,example2,...,example10
prefix,pre-,prepare,preview,prelude
suffix,-ment,government,ailment,fragment
Sample (prefix, root, suffix) with frequency weighting to generate Joyce-style words (e.g., pre+river+ation → priveration), then wrap in a few hundred short sentences to guarantee coverage.
New forms are added to the tokenizer as plain tokens (bare forms + SentencePiece start-of-word variants ▁token). I disable mean-resizing when expanding the embedding matrix (resize_token_embeddings(..., mean_resizing=False)) so that custom initialisation is preserved, and I tie the output head to the input embeddings so the new vectors participate in prediction.
For new token w with greedy longest prefix/suffix match (p, s) and core r, set:
E(w) = α·E(p̄) + (1 − 2α)·E(r̄) + α·E(s̄) + ε
average embeddings of high-quality example words if a morpheme isn't single-token; ε is small Gaussian noise for diversity. If r is unseen, fall back to a small random vector scaled to the embedding std.
Freeze everything except input_embeddings (+ tied head). Train on synthetic sentences + Wake blocks with Adafactor.
Typical hyperparameters:
max_steps: 2000learning_rate: 5e-4 (AdamW)batch_size: 1gradient_accumulation_steps: 16warmup_ratio: 0.05save_steps: 100eval_steps: 200 (on held-out Wake blocks)use_cache = Falsefp16 = False(bf16 if available)- Optional:
gradient_checkpointingonly if memory is tight
Unfreeze all parameters. Fine-tune on Finnegans Wake with conservative schedules, early stopping on validation loss, and pinned software versions.
Typical hyperparameters (TinyLlama / LLaMA on T4):
num_train_epochs: 2learning_rate: 2e-5 (main model / LoRA)warmup_ratio: 0.10per_device_train_batch_size: 8gradient_accumulation_steps: 2weight_decay: 0.01save_steps: 200- early stopping on validation loss,
patience = 2 gradient_checkpointing = True(to manage memory)fp16 = Falseon T4 (bf16 preferred if available)
Same as two-phase protocol, establishing baseline Wake token embeddings.
Hyperparameters:
- max_steps: 1300
- lr: 5e-4
- Warmup ratio: 0.05
- grad_accum: 16
- batch_size: 1
- Sequence length: 256 tokens
Attach LoRA adapters to attention/MLP layers while keeping embeddings frozen. Train on Wake corpus to adapt model behavior without disturbing embedding space.
Typical hyperparameters:
- Epochs: 1-2
- lr: 2e-5
- Warmup: 0.10
- grad_accum: 16
- batch_size: 1
- LoRA rank: 8-16
- Target modules: q_proj, v_proj, mlp layers
Unfreeze embeddings with morpheme-aware regularization. Uses decomposition data (prefixes/suffixes) to enforce compositional semantics in new token embeddings.
Loss components:
- L_lm: Standard language modeling loss
- L_morpheme: Compositional constraint forcing Wake tokens toward component averages
- Example:
E["allbust"] ≈ mean(E["all"], E["bust"])
- Example:
- L_repulsion: Adversarial term preventing Wake token collapse
- L_norm: Norm hygiene keeping Wake embeddings in distribution
Typical hyperparameters:
- max_steps: 400-800
- lr: 5e-5
- Warmup: 0.10
- Optimizer: Adafactor
- Gradient masking: New tokens only
- Loss weights: λ_morpheme=0.1, λ_repulsion=0.05, λ_norm=0.01
Data requirements:
- Morpheme decomposition mapping (JSON format)
- Prefix/suffix inventory with examples
- Component token validation in base vocabulary
Expected outcomes:
- Morphologically related tokens cluster in embedding space
- K-nearest neighbors reflect compositional structure
- Embedding norms remain stable relative to base vocabulary
- Isotropy preserved in extended vocabulary subspace
- Base text: Finnegans Wake plain text (blockified; small held-out slice)
- Synthetic sentences: ~600, each containing ≥1 injected token
- Token additions: Recent runs added 447–534 new tokens after filtering duplicates (varies by CSV)
- Tokenizer vocabulary size after expansion: 33,098 (base ≈ 32k → 32k+Δ)
- Maximum sequence length: 2,048 (standard); 384–512 on T4 for memory-constrained runs
- Datasets: Blockified Wake text with a held-out set
- Train blocks: 1,566
- Valid blocks: 174
Dependencies:
- Python 3.12
transformers==4.57.1accelerate==1.2.1datasets>=2.21.0pyarrow==22.0.0peft>=0.11(for LoRA experiments)bitsandbytes(optional, for 8-bit optimisers)umap-learnfaiss-cpuwordfrequnidecodematplotlib
Colab quirk: If Trainer errors with unwrap_model(..., keep_torch_compile=...), pin accelerate>=1.2 or apply a tiny compatibility shim.
Performance notes:
- Keep
use_cache=Falseduring training - Prefer Adafactor or 8-bit Adam on T4
- Avoid fp16 on T4 for this pipeline to maintain stability
- Enable gradient checkpointing in Phase 2 to reduce memory
the validation gap measures how much Wake is bending the model.
- Embeddings-only training on the full corpus
- No held-out set; metrics reported only on the training blocks
- Apparent near-zero loss was actually memorisation of Wake slices
- Train ↓, Val ↔ is a classic overfit signature, but here it also confirms that:
- P1 embeddings were correctly loaded (P2 starts around 4.5, not 7+),
- Wake is small/weird enough that the model can memorise it quickly.
- A validation loss of ~4.8 is a more honest measure of generalisation to unseen Wake text.
- The train/val gap that already existed in P1 simply wasn’t visible without a held-out set.
Rather than “fixing” this gap, later P1/P2 runs explicitly use it:
- P1 is now structured into regimes (sweet / plateau / fried) and uses the gap as a meltdown indicator.
- P2 branches (e.g. P2(sweet), P2(plateau), P2(fried)) treat different levels of overfitting as starting points, to see how much damage a full fine-tune can repair.
This pilot P2 run is kept as the moment the project stopped pretending val loss should be flat and started using it as a control knob.
- Geometry: Top-k neighbour overlap before and after, embedding norm deltas, optional isotropy and PIP loss
- Language: Validation loss and optional perplexity on held-out Wake slice
- Behaviour: Short generation probes that seed with high-drift and low-drift tokens, nearest-neighbour maps saved to JSON for audit
pip install -r requirements.txt- Lexicon: Parse or regenerate the morpheme maps and write
wake_lexicon.txt - Token injection: Expand the tokenizer, compose embeddings, tie the head
- Training: Run Phase 1 embedding warm-up, then Phase 2 full fine-tune
- Metrics and report: Write snapshots, compute overlaps, and build
results/wake2vec_report.html
# Place your morphemes at data/morphemes.csv
python wake2vec_morpheme_expansion.py --base_model TinyLlama/TinyLlama-1.1B-intermediate-step-1431k-3T
# See runs/<id>/ for metrics, plots, and the HTML report- On GPU:
BASE_MODEL="TinyLlama/TinyLlama-1.1B-Chat-v1.0"is a good default - On CPU:
BASE_MODEL="distilgpt2"to smoke-test the pipeline
results/summary_stats.json,results/morpheme_comparison.jsonresults/pre_morpheme_snapshot.json,results/post_morpheme_snapshot.jsonresults/wake2vec_report.htmlwith t-SNE, histograms, and tablescheckpoints/*and arun_meta.jsonthat records hyperparameters and paths
runs/<id>/
├── morpheme_data.json
├── synthetic_lines.txt
├── tokenizer adapters (saved early to avoid ID drift)
├── summary_stats_p1.json, morpheme_comparison_p1.json
├── summary_stats_p3.json, morpheme_comparison_p3.json
├── phase2_loss_log.json, phase3_live_log.json
├── plots/
│ ├── hist_overlap_top5.png
│ ├── hist_norm_change.png
│ ├── scatter_norm_vs_overlap.png
│ ├── tsne_newtokens_vs_precentroids.png
│ └── phase loss curves
└── wake2vec_report.html
Optional tarball in archives/.
- If
load_best_model_at_end=True, matcheval_strategyandsave_strategyto"steps" - For OOM on T4: reduce
per_device_train_batch_size, increasegradient_accumulation_steps, shortenMAX_LEN, or switch Phase 2 to LoRA (recommended) - Keep random seeds fixed for comparability across phases
- Prefer Adafactor or 8-bit Adam on T4
- Keep fp16 off on T4 for this pipeline
- Set
use_cache=Falseduring training to reduce memory
The repository contains multiple notebooks for different aspects of the pipeline:
- Lexicon generation: Build morpheme maps from Finnegans Wake
- Token injection: Expand tokenizer with compositional initialisation
- Two-phase training: Standard embedding warm-up + fine-tune
- Three-phase training: Experimental LoRA + embed-alignment
- Evaluation: Geometry diagnostics, neighbour analysis, visualisation
- Report generation: HTML artifacts with plots and tables
Each notebook is designed to be run independently or as part of the full pipeline.
For long-running training experiments on preemptible compute instances, the repository includes a dedicated monitoring notebook that provides non-invasive inspection of training progress without interfering with active processes.
The heartbeat system tracks:
- Training loss trajectory from JSON logs and trainer state files
- Evaluation metrics at configurable step intervals
- Checkpoint inventory across local ephemeral and persistent storage
- Embedding snapshot presence and modification times
- Age reporting for all artifacts in human-readable format
The monitoring system inspects three storage locations:
- Local ephemeral:
/content/runs/t4_*(active training directory) - Drive persistent:
/content/drive/MyDrive/wake2vec/runs/t4_*(synchronized copy) - Sentry backup:
/content/drive/MyDrive/wake2vec/sentry_backups/t4_*(safety mirror)
Checkpoints are verified by checking for valid weight files (model.safetensors, pytorch_model.bin, or sharded variants). The system automatically identifies the most recent valid checkpoint suitable for resumption, excluding incomplete or corrupted saves.
The monitoring notebook is designed for manual execution at user-defined intervals. Typical usage patterns include hourly checks during active training, post-checkpoint verification after save events, and pre-resume validation before launching continuation runs.
The Llama/ directory contains experimental work extending Wake2Vec to larger language models, specifically Meta's Llama 3.1 8B and Llama 3.2 3B architectures.
While the primary Wake2Vec pipeline targets TinyLlama (1.1B parameters) for compute efficiency and rapid iteration, the Llama trials investigate whether the morpheme-aware embedding injection methodology scales to models with substantially larger capacity and more sophisticated language understanding.
Adapting Wake2Vec to Llama models introduced several technical constraints:
Memory limitations: Llama-3.2-1B requires 4-bit quantization via bitsandbytes to fit on Colab T4 GPUs (15GB VRAM). The working configuration uses NF4 quantization with double quantization enabled, allocating 13GB to GPU and 30GB to CPU offload.
Gated model access: Llama models require explicit approval from Meta via Hugging Face, introducing authentication steps in training pipelines.
Library compatibility (Nov 2025 Colab): Default Colab environment (torch 2.8.0, CUDA 12.9) conflicts with bitsandbytes and triton. The working configuration requires explicit downgrade: torch==2.5.1+cu121, triton==3.1.0, bitsandbytes==0.43.3, transformers==4.45.2, accelerate==0.34.2, peft==0.13.2. Runtime restart required after installation.
Gradient checkpointing incompatibility: 4-bit quantized models with LoRA adapters cannot use gradient checkpointing due to interaction between quantization and activation recomputation. This limits batch size options.
Llama-3.2-1B trials use the following configuration:
- Quantization: 4-bit NF4 with double quantization
- Sequence length: 256 tokens
- Batch size: 8 with gradient accumulation of 2 (effective batch: 16)
- Learning rate: 2e-5 (LoRA fine-tune phase)
- Scheduler: Cosine with 10% warmup
- PEFT adapter: LoRA r=8 on q_proj, v_proj, gate_proj, up_proj, down_proj
- Regularization: Weight decay 0.01, max grad norm 1.0, dropout 0.1
Right now, the implemented and tested parts of this repo are:
-
TinyLlama P1 v2:
- embedding-only fine-tune on Finnegans Wake with a 90/10 train/val split,
- gradient-masked base vocab, ~44.5k Wake tokens trainable,
- checkpoint + embedding snapshot infrastructure that survives Colab chaos.
-
LLaMA P1 (experimental):
- 4-bit NF4 quantised Llama-3.2-1B,
- Wake token injection with spherical init,
- embedding-only warm-up with full checkpoint mirroring to Drive.
The morpheme-aware initialisation and three-phase protocol are partially implemented.
- Text: James Joyce, Finnegans Wake
- Base model: TinyLlama-1.1B-Chat
- Conceptual inspiration from work on embedding surgery, retrofitting, and lightweight adapter methods
Cite: https://github.com/mahb97/Wake2vec/blob/21469d75c26d40988ec5af8a4358d1796a36fdf0/data/CITATION.cff