Mini-SGLang is designed as a distributed system to handle Large Language Model (LLM) inference efficiently. It consists of several independent processes working together.
- API Server: The entry point for users. It provides an OpenAI-compatible API (e.g.,
/v1/chat/completions) to receive prompts and return generated text. - Tokenizer Worker: Converts input text into numbers (tokens) that the model can understand.
- Detokenizer Worker: Converts the numbers (tokens) generated by the model back into human-readable text.
- Scheduler Worker: The core worker process. In a multi-GPU setup, there is one Scheduler Worker for each GPU (referred to as a TP Rank). It manages the computation and resource allocation for that specific GPU.
The components communicate using ZeroMQ (ZMQ) for control messages and NCCL (via torch.distributed) for heavy tensor data exchange between GPUs.
Request Lifecycle:
- User sends a request to the API Server.
- API Server forwards it to the Tokenizer.
- Tokenizer converts text to tokens and sends them to the Scheduler (Rank 0).
- Scheduler (Rank 0) broadcasts the request to all other Schedulers (if using multiple GPUs).
- All Schedulers schedule the request and trigger their local Engine to compute the next token.
- Scheduler (Rank 0) collects the output token and sends it to the Detokenizer.
- Detokenizer converts the token to text and sends it back to the API Server.
- API Server streams the result back to the User.
The source code is located in python/minisgl. Here is a breakdown of the modules for developers:
minisgl.core: Provides core dataclassesReqandBatchrepresenting the state of requests, classContextwhich holds the global state of the inference context, and classSamplingParamsholds the sampling parameters provided by users.minisgl.distributed: Provides the interface to all-reduce and all-gather in tensor parallelism, and dataclassDistributedInfowhich holds the TP information for a TP worker.minisgl.layers: Implements basic building blocks for building LLMs with TP support, including linear, layernorm, embedding, RoPE, etc. They share common base classes defined inminisgl.layers.base.minisgl.models: Implements LLM models, including Llama and Qwen3. Also defines utilities for loading weights from huggingface and sharding weights.minisgl.attention: Provides interface of attention Backends and implements backends offlashattentionandflashinfer. They are called byAttentionLayerand use metadata stored inContext.minisgl.kvcache: Provides interface of KVCache pool and KVCache manager, and implementsMHAKVCache,NaiveCacheManagerandRadixCacheManager.minisgl.utils: Provides a collection of utilities, including logger setup and wrappers around zmq.minisgl.engine: ImplementsEngineclass, which is a TP worker on a single process. It manages the model, context, KVCache, attention backend and cuda graph replaying.minisgl.message: Defines messages exchanged (in zmq) between api_server, tokenizer, detokenizer and scheduler. All message types support automatic serialization and deserialization.minisgl.scheduler: ImplementsSchedulerclass, which runs on each TP worker process and manages the correspondingEngine. The rank 0 scheduler receives msgs from tokenizer, communicates with scheduler on other TP workers, and sends msgs to detokenizer.minisgl.server: Defines cli arguments andlaunch_serverwhich starts all the subprocesses of Mini-SGLang. Also implements a FastAPI server inminisgl.server.api_serveracting as a frontend, providing endpoints such as/v1/chat/completions.minisgl.tokenizer: Implementstokenize_workerfunction which handles tokenization and detokenization requests.minisgl.llm: Provides classLLMas a python interface to interact with the Mini-SGLang system easily.minisgl.kernel: Implements custom CUDA kernels, supported bytvm-ffifor python binding and jit interface.minisgl.benchmark: Benchmark utilities.
