Skip to content

unruli/chibu-chat

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

153 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

chibuchat

chibuchat logo

The best ChatGPT that $100 can buy.

Chibuchat Tips

To get started with a runpod one-click template (affiliate), click here, select 8 x H100 SXM (or you can try NVL or PCIe or A100 [slowest]) and then ssh in and run speedrun as below.

The Runpod template injects the following variables, which you'll want to set via secrets: Runpod Template environment variables

Once ssh'd in, you can install screen and start the whole run:

cd ~/chibu-chat
apt-get update && apt-get install -y screen
export WANDB_RUN=d20
screen -L -Logfile speedrun.log -S speedrun bash speedrun.sh
  • To safely detach from the session press: Ctrl+A then D (that's Ctrl+A, release, then press D).
  • To see the logs use tail -f speedrun.log.

Note that:

  • passing WANDB_RUN will set the name for the run and start logging to WANDB, which will use your WANDB_API_KEY if set in env OR prompt for login during startup

Resuming or continuing after an interruption

Reuse your original session, if still running, with screen -r speedrun, or start a new session by re-running the whole run commands above, optionally commenting out lines in speedrun.sh that you do not wish to re-run.

Uploading checkpoints to Hugging Face

python -m scripts.push_to_hf uploads base/mid/sft checkpoints directly from $NANOCHAT_BASE_DIR:

source .venv/bin/activate
export NANOCHAT_BASE_DIR="$HOME/.cache/nanochat"
# Base/Mid/SFT checkpoints into subfolders of one repo
python -m scripts.push_to_hf --stage base --repo-id unruli/chibu-chat --path-in-repo base/d20
python -m scripts.push_to_hf --stage mid  --repo-id unruli/chibu-chat --path-in-repo mid/d20
python -m scripts.push_to_hf --stage sft  --repo-id unruli/chibu-chat --path-in-repo sft/d20

# Report + Tokenizer folders (use --model-dir explicitly)
# export NANOCHAT_BASE_DIR="$HOME/.cache/nanochat"
python -m scripts.push_to_hf --model-dir "$NANOCHAT_BASE_DIR/report" \
  --repo-id unruli/chibu-chat --path-in-repo report/latest
python -m scripts.push_to_hf --model-dir "$NANOCHAT_BASE_DIR/tokenizer" \
  --repo-id unruli/chibu-chat --path-in-repo tokenizer/latest

Downloading checkpoints from HuggingFace

To download checkpoints, you first need to run installs:

command -v uv &> /dev/null || curl -LsSf https://astral.sh/uv/install.sh | sh
[ -d ".venv" ] || uv venv
uv sync --extra gpu
source .venv/bin/activate

# Example: grab SFT d20 into the local cache
python -m scripts.pull_from_hf --repo-id unruli/chibu-chat \
  --repo-path sft/d20 --stage sft --target-tag d20

# Tokenizer / report assets
export NANOCHAT_BASE_DIR="$HOME/.cache/nanochat"
python -m scripts.pull_from_hf --repo-id unruli/chibu-chat \
  --repo-path tokenizer/latest --dest-dir "$NANOCHAT_BASE_DIR/tokenizer"
python -m scripts.pull_from_hf --repo-id unruli/chibu-chat \
  --repo-path report/latest --dest-dir "$NANOCHAT_BASE_DIR/report"

Downloads land in $NANOCHAT_BASE_DIR/{base,mid,chatsft} (or the --dest-dir you provided), so scripts like chat_web see them right away.

Running the chat interface (after training OR downloading checkpoints)

When you're done with training and pushing to hub (or you have just downloaded checkpoints), port 8000 is exposed on the runpod template, so you can run:

source .venv/bin/activate
export NANOCHAT_BASE_DIR="$HOME/.cache/nanochat"
python -m scripts.chat_web

and then access it via https://21kgtegp93ibt7-8000.proxy.runpod.net.

About ChibuChat

This repo is a full-stack implementation of an LLM like ChatGPT in a single, clean, minimal, hackable, dependency-lite codebase. chibuchat is designed to run on a single 8XH100 node via scripts like speedrun.sh, that run the entire pipeline start to end. This includes tokenization, pretraining, finetuning, evaluation, inference, and web serving over a simple UI so that you can talk to your own LLM just like ChatGPT. chibuchat is maintained as its own project direction.

Talk to it

To get a sense of the endpoint of this repo, you can currently find chibuchat d32 hosted on nanochat.karpathy.ai. "d32" means that this model has 32 layers in the Transformer neural network. This model has 1.9 billion parameters, it was trained on 38 billion tokens by simply running the single script run1000.sh, and the total cost of training was ~$800 (about 33 hours training time on 8XH100 GPU node). While today this is enough to outperform GPT-2 of 2019, it falls dramatically short of modern Large Language Models like GPT-5. When talking to these micro models, you'll see that they make a lot of mistakes, they are a little bit naive and silly and they hallucinate a ton, a bit like children. It's kind of amusing. But what makes chibuchat unique is that it is fully yours - fully configurable, tweakable, hackable, and trained by you from start to end. To train and talk to your own, we turn to...

Quick start

The fastest way to feel the magic is to run the speedrun script speedrun.sh, which trains and inferences the $100 tier of chibuchat. On an 8XH100 node at $24/hr, this gives a total run time of about 4 hours. Boot up a new 8XH100 GPU box from your favorite provider (e.g. I use and like Lambda), and kick off the training script:

bash speedrun.sh

Alternatively, since the script runs for 4 hours, I like to launch it like this inside a new screen session speedrun (and also log output to speedrun.log):

screen -L -Logfile speedrun.log -S speedrun bash speedrun.sh

See the screen cheatsheet if you are less familiar. You can watch it go inside the screen session, or detach with Ctrl-a d and tail speedrun.log to view progress. Now wait 4 hours. Once it's done, you can talk to your LLM via the ChatGPT-like web UI. Make sure again that your local uv virtual environment is active (run source .venv/bin/activate), and serve it:

python -m scripts.chat_web

And then visit the URL shown. Make sure to access it correctly, e.g. on Lambda use the public IP of the node you're on, followed by the port, so for example http://209.20.xxx.xxx:8000/, etc. Then talk to your LLM as you'd normally talk to ChatGPT! Get it to write stories or poems. Ask it to tell you who you are to see a hallucination. Ask it why the sky is blue. Or why it's green. The speedrun is a 4e19 FLOPs capability model so it's a bit like talking to a kindergartener :).


image

You can also cat report.md file which appeared in the project directory and contains the "report card" of the run, i.e. a bunch of evaluations and metrics. At the very end, you'll see a summary table, for example:


  • Characters: 333,989
  • Lines: 8,304
  • Files: 44
  • Tokens (approx): 83,497
  • Dependencies (uv.lock lines): 2,004
Metric BASE MID SFT RL
CORE 0.2219 - - -
ARC-Challenge - 0.2875 0.2807 -
ARC-Easy - 0.3561 0.3876 -
GSM8K - 0.0250 0.0455 0.0758
HumanEval - 0.0671 0.0854 -
MMLU - 0.3111 0.3151 -
ChatCORE - 0.0730 0.0884 -

Total wall clock time: 3h51m


(Your table might be missing the RL number by default). For a lot more information around the speedrun script and what to look for and expect, please refer to the walkthrough that I posted in Discussions of the repo: "Introducing chibuchat: The best ChatGPT that $100 can buy".

Bigger models

Unsurprisingly, $100 is not enough to train a highly performant ChatGPT clone. In fact, LLMs are famous for their multi-million dollar capex. For our purposes, I think there are two more scales of interest. First is the ~$300 tier d26 model (i.e. depth=26) that trains in ~12 hours, which slightly outperforms GPT-2 CORE score. Second is the $1000 tier (~41.6 hours), just because it's a nice round number. But both of these are not yet fully supported and therefore not attached here in the master branch yet.

That said, to give a sense, the example changes needed for the speedrun.sh file to train a GPT-2 grade model d26 only involve three changes:

...
# you'll need to download more data shards for pretraining
# get the number of parameters, multiply 20 to get tokens, multiply by 4.8 to get chars,
# divide by 250 million to get number of shards. todo need to improve this...
python -m nanochat.dataset -n 450 &
...
# use --depth to increase model size. to not oom, halve device batch size 32 -> 16:
torchrun --standalone --nproc_per_node=8 -m scripts.base_train -- --depth=26 --device_batch_size=16
...
# make sure to use the same later during midtraining:
torchrun --standalone --nproc_per_node=8 -m scripts.mid_train -- --device_batch_size=16

That's it! The biggest thing to pay attention to is making sure you have enough data shards to train on (the code will loop and do more epochs over the same training set otherwise, decreasing learning speed a bit), and managing your memory/VRAM, primarily by decreasing the device_batch_size until things fit (the scripts automatically compensate by increasing the number of gradient accumulation loops, simply turning parallel compute to sequential compute).

And a bit more about computing environments that will run chibuchat:

  • The code will run just fine on the Ampere 8XA100 GPU node as well, but a bit slower.
  • All code will run just fine on even a single GPU by omitting torchrun, and will produce ~identical results (code will automatically switch to gradient accumulation), but you'll have to wait 8 times longer.
  • If your GPU(s) have less than 80GB, you'll have to tune some of the hyperparameters or you will OOM / run out of VRAM. Look for --device_batch_size in the scripts and reduce it until things fit. E.g. from 32 (default) to 16, 8, 4, 2, or even 1. Less than that you'll have to know a bit more what you're doing and get more creative.
  • Most of the code is fairly vanilla PyTorch so it should run on anything that supports that - xpu, mps, or etc, but I haven't implemented this out of the box so it might take a bit of tinkering.

Running on CPU / MPS

chibuchat can be run on CPU or on MPS (if you're on Macbook), and will automatically try to detect what device is best to run on. You're not going to get too far without GPUs, but at least you'll be able to run the code paths and maybe train a tiny LLM with some patience. For an example of how to make all the run commands much smaller (feel free to tune!), you can refer to dev/run=cpu.sh file. You'll see that I'm essentially restricting all scripts to train smaller models, to run for shorter number of iterations, etc. This functionality is new, slightly gnarly (touched a lot of code), and was merged in this CPU|MPS PR on Oct 21, 2025.

Customization

To customize your chibuchat, see Guide: infusing identity to your nanochat in Discussions, which describes how you can tune your chibuchat personality through synthetic data generation and mixing that data into midtraining and SFT stages.

Additionally, to add new abilities to chibuchat, see Guide: counting r in strawberry (and how to add abilities generally).

Questions

chibuchat is designed to be short and sweet. One big advantage of this is that we can package up all of the files together and copy paste them to your favorite LLM to ask arbitrary questions. As an example, I like to package up the repo using the files-to-prompt utility like so:

files-to-prompt . -e py -e md -e rs -e html -e toml -e sh --ignore "*target*" --cxml > packaged.txt

This includes all py, rs, html, toml, sh files, excludes the rustbpe/target folder, and chooses the cxml output format. Everything is written to the packaged.txt file, which atm measures ~330KB (i.e. well below ~100K tokens for a state of the art LLM), and ~8K lines of code in 45 files.

Alternatively, I recommend using DeepWiki from Devin/Cognition to ask questions of this repo. In the URL of this repo, simply change github.com to deepwiki.com, and you're off.

Tests

I haven't invested too much here but some tests exist, especially for the tokenizer. Run e.g. as:

python -m pytest tests/test_rustbpe.py -v -s

File structure

.
├── LICENSE
├── README.md
├── dev
│   ├── gen_synthetic_data.py       # Example synthetic data for identity
│   ├── generate_logo.html
│   ├── nanochat.png
│   ├── repackage_data_reference.py # Pretraining data shard generation
│   └── runcpu.sh                   # Small example of how to run on CPU/MPS
├── nanochat
│   ├── __init__.py                 # empty
│   ├── adamw.py                    # Distributed AdamW optimizer
│   ├── checkpoint_manager.py       # Save/Load model checkpoints
│   ├── common.py                   # Misc small utilities, quality of life
│   ├── configurator.py             # A superior alternative to argparse
│   ├── core_eval.py                # Evaluates base model CORE score (DCLM paper)
│   ├── dataloader.py               # Tokenizing Distributed Data Loader
│   ├── dataset.py                  # Download/read utils for pretraining data
│   ├── engine.py                   # Efficient model inference with KV Cache
│   ├── execution.py                # Allows the LLM to execute Python code as tool
│   ├── gpt.py                      # The GPT nn.Module Transformer
│   ├── logo.svg
│   ├── loss_eval.py                # Evaluate bits per byte (instead of loss)
│   ├── muon.py                     # Distributed Muon optimizer
│   ├── report.py                   # Utilities for writing the chibuchat report
│   ├── tokenizer.py                # BPE Tokenizer wrapper in style of GPT-4
│   └── ui.html                     # HTML/CSS/JS for chibuchat frontend
├── pyproject.toml
├── run1000.sh                      # Train the ~$800 chibuchat d32
├── rustbpe                         # Custom Rust BPE tokenizer trainer
│   ├── Cargo.lock
│   ├── Cargo.toml
│   ├── README.md                   # see for why this even exists
│   └── src
│       └── lib.rs
├── scripts
│   ├── base_eval.py                # Base model: calculate CORE score
│   ├── base_loss.py                # Base model: calculate bits per byte, sample
│   ├── base_train.py               # Base model: train
│   ├── chat_cli.py                 # Chat model (SFT/Mid): talk to over CLI
│   ├── chat_eval.py                # Chat model (SFT/Mid): eval tasks
│   ├── chat_rl.py                  # Chat model (SFT/Mid): reinforcement learning
│   ├── chat_sft.py                 # Chat model: train SFT
│   ├── chat_web.py                 # Chat model (SFT/Mid): talk to over WebUI
│   ├── mid_train.py                # Chat model: midtraining
│   ├── tok_eval.py                 # Tokenizer: evaluate compression rate
│   └── tok_train.py                # Tokenizer: train it
├── speedrun.sh                     # Train the ~$100 chibuchat d20
├── tasks
│   ├── arc.py                      # Multiple choice science questions
│   ├── common.py                   # TaskMixture | TaskSequence
│   ├── customjson.py               # Make Task from arbitrary jsonl convos
│   ├── gsm8k.py                    # 8K Grade School Math questions
│   ├── humaneval.py                # Misnomer; Simple Python coding task
│   ├── mmlu.py                     # Multiple choice questions, broad topics
│   ├── smoltalk.py                 # Conglomerate dataset of SmolTalk from HF
│   └── spellingbee.py              # Task teaching model to spell/count letters
├── tests
│   └── test_engine.py
│   └── test_rustbpe.py
└── uv.lock

License

MIT

About

The best ChatGPT that $100 can buy.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages

  • Python 81.6%
  • Shell 10.7%
  • HTML 4.2%
  • Rust 3.5%