Skip to content

Latest commit

 

History

History
49 lines (37 loc) · 4.27 KB

File metadata and controls

49 lines (37 loc) · 4.27 KB

Structure of Mini-SGLang

System Architecture

Mini-SGLang is designed as a distributed system to handle Large Language Model (LLM) inference efficiently. It consists of several independent processes working together.

Key Components

  • API Server: The entry point for users. It provides an OpenAI-compatible API (e.g., /v1/chat/completions) to receive prompts and return generated text.
  • Tokenizer Worker: Converts input text into numbers (tokens) that the model can understand.
  • Detokenizer Worker: Converts the numbers (tokens) generated by the model back into human-readable text.
  • Scheduler Worker: The core worker process. In a multi-GPU setup, there is one Scheduler Worker for each GPU (referred to as a TP Rank). It manages the computation and resource allocation for that specific GPU.

Data Flow

The components communicate using ZeroMQ (ZMQ) for control messages and NCCL (via torch.distributed) for heavy tensor data exchange between GPUs.

Process overview diagram

Request Lifecycle:

  1. User sends a request to the API Server.
  2. API Server forwards it to the Tokenizer.
  3. Tokenizer converts text to tokens and sends them to the Scheduler (Rank 0).
  4. Scheduler (Rank 0) broadcasts the request to all other Schedulers (if using multiple GPUs).
  5. All Schedulers schedule the request and trigger their local Engine to compute the next token.
  6. Scheduler (Rank 0) collects the output token and sends it to the Detokenizer.
  7. Detokenizer converts the token to text and sends it back to the API Server.
  8. API Server streams the result back to the User.

Code Organization (minisgl Package)

The source code is located in python/minisgl. Here is a breakdown of the modules for developers:

  • minisgl.core: Provides core dataclasses Req and Batch representing the state of requests, class Context which holds the global state of the inference context, and class SamplingParams holds the sampling parameters provided by users.
  • minisgl.distributed: Provides the interface to all-reduce and all-gather in tensor parallelism, and dataclass DistributedInfo which holds the TP information for a TP worker.
  • minisgl.layers: Implements basic building blocks for building LLMs with TP support, including linear, layernorm, embedding, RoPE, etc. They share common base classes defined in minisgl.layers.base.
  • minisgl.models: Implements LLM models, including Llama and Qwen3. Also defines utilities for loading weights from huggingface and sharding weights.
  • minisgl.attention: Provides interface of attention Backends and implements backends of flashattention and flashinfer. They are called by AttentionLayer and use metadata stored in Context.
  • minisgl.kvcache: Provides interface of KVCache pool and KVCache manager, and implements MHAKVCache, NaiveCacheManager and RadixCacheManager.
  • minisgl.utils: Provides a collection of utilities, including logger setup and wrappers around zmq.
  • minisgl.engine: Implements Engine class, which is a TP worker on a single process. It manages the model, context, KVCache, attention backend and cuda graph replaying.
  • minisgl.message: Defines messages exchanged (in zmq) between api_server, tokenizer, detokenizer and scheduler. All message types support automatic serialization and deserialization.
  • minisgl.scheduler: Implements Scheduler class, which runs on each TP worker process and manages the corresponding Engine. The rank 0 scheduler receives msgs from tokenizer, communicates with scheduler on other TP workers, and sends msgs to detokenizer.
  • minisgl.server: Defines cli arguments and launch_server which starts all the subprocesses of Mini-SGLang. Also implements a FastAPI server in minisgl.server.api_server acting as a frontend, providing endpoints such as /v1/chat/completions.
  • minisgl.tokenizer: Implements tokenize_worker function which handles tokenization and detokenization requests.
  • minisgl.llm: Provides class LLM as a python interface to interact with the Mini-SGLang system easily.
  • minisgl.kernel: Implements custom CUDA kernels, supported by tvm-ffi for python binding and jit interface.
  • minisgl.benchmark: Benchmark utilities.