A small GPT-style Transformer language model trained to generate short stories.
It uses a ByteLevel BPE tokenizer and a causal self-attention model implemented in PyTorch.
- Pretrained weights + tokenizer (trained by me - 29M parameters): https://huggingface.co/DanielG9/Story-LLM
- Dataset used: https://huggingface.co/datasets/roneneldan/TinyStories
This repo contains everything needed to:
- download and preprocess a TinyStories subset,
- train a small Transformer language model,
- download my pretrained checkpoint + tokenizer from Hugging Face.
pip install -r requirements.txtpython code/download_model.pyThis downloads:
checkpoints/model.pttokenizer/tokenizer.json
python code/run.pyYou’ll get a prompt:
- type a prompt and press Enter to generate text.
python code/download_dataset.pyThis saves:
data/train.parquet
python code/tokenizer.pyThis saves:
tokenizer/tokenizer.json
python code/train.pyCheckpoints are saved to:
checkpoints/model_epoch_*.pt
Training resumes automatically only if you place a checkpoint at:
checkpoints/model-start.pt
Then run:
python code/train.pyNote: training will raise an error if the checkpoint’s saved
model_configdoes not match your current settings incode/model_params.py.
.
├─ code/
│ ├─ download_dataset.py # Downloads TinyStories and writes data/train.parquet
│ ├─ download_model.py # Downloads pretrained weights + tokenizer from HF
│ ├─ model.py # Transformer model
│ ├─ model_params.py # All hyperparameters and paths
│ ├─ run.py # Run the model
│ ├─ tokenizer.py # Trains and saves ByteLevel BPE tokenizer
│ └─ train.py # Training loop + checkpoint saving
├─ checkpoints/ # Saved model checkpoints (ignored by git)
├─ data/ # Dataset parquet files (ignored by git)
├─ tokenizer/ # tokenizer.json (ignored by git)
└─ requirements.txtAll key settings live in:
code/model_params.py
Including:
- tokenizer vocab size (
VOCAB_SIZE) - model size (
EMBED_DIM,NUM_LAYERS,NUM_HEADS,MAX_MODEL_CONTEXT) - training params (
BATCH_SIZE,EPOCHS,LEARNING_RATE, etc.)