A comprehensive distributed GPU cluster management system combining profiling, model storage, and orchestration capabilities.
gswarm is an integrated platform for managing GPU clusters, providing:
- GPU Profiling: Multi-node GPU monitoring and performance analysis
- Model Management: Distributed model storage, deployment, and serving
- Data Pooling: Efficient data management across nodes
- Task Orchestration: Queue-based asynchronous task execution
The system uses a host-client architecture where a central host node coordinates operations across multiple client nodes, enabling unified management of your entire GPU infrastructure.
- Monitor GPU utilization and memory usage across multiple machines
- Track PCIe bandwidth (GPU-DRAM) and NVLink (GPU-GPU) connections
- Configurable sampling frequency with JSON output
- Built on nvitop for accurate GPU metrics
- Fault tolerance with automatic reconnection
- Session recovery after crashes
- Distributed model storage across disk, DRAM, and GPU memory
- Automatic model deployment and serving
- Cross-node model transfer and replication
- Support for multiple model frameworks (vLLM, Transformers, TGI)
- Real-time model status tracking
- Distributed data chunk management
- Automatic data migration between devices
- Reference counting and garbage collection
- Transparent cross-node data access
- Support for model inputs/outputs chaining
- Asynchronous task execution with priorities
- Dependency management and resource conflict detection
- Parallel execution of independent tasks
- Automatic retry with exponential backoff
- Python 3.8 or higher
- NVIDIA GPUs with installed drivers
- Network connectivity between cluster nodes
# Clone the repository
git clone https://github.com/yourusername/gswarm.git
cd gswarm
# Install the package
pip install .# Start host with both profiling and model management
gswarm host start --port 8090 --http-port 8091 --model-port 9010On each GPU machine:
# Connect client with resilient mode
gswarm client connect <host-ip>:8090 --resilient# Start profiling
gswarm profiler start --name training_run
# Check status
gswarm profiler status
# Stop profiling
gswarm profiler stop --name training_run# List available models
gswarm model list
# Download a model (on host node)
gswarm model download llama-7b --source huggingface --url https://huggingface.co/meta-llama/Llama-3.1-8B-Instruct --node node1 --type llm
# or use hf:// format
gswarm model download llama-7b --source hf://meta-llama/Llama-3.1-8B-Instruct --node node1 --type llm
# Download a model (on client node, if node-id is not specified, it will download local)
gswarm model download llama-7b --source huggingface --url https://huggingface.co/meta-llama/Llama-3.1-8B-Instruct --type llm
# or use hf:// format
gswarm model download llama-7b --source hf://meta-llama/Llama-3.1-8B-Instruct --type llm
# Deploy model to GPU (on client node)
gswarm model move llama-7b --from disk --to gpu0
# if i am on host, i must specify the node id
gswarm model move llama-7b --from disk --to gpu0 --node node1
# Start model serving (on client node)
# each model requires different method to implement serving, this is provideed in model/instance/xxx.py
# xxx is the model type, we use type to support different model inference methods
gswarm model serve llama-7b --device gpu0 --port 8080
# if i am on host, i must specify the node id
gswarm model serve llama-7b --device gpu0 --port 8080 --node node1
# Check model status
gswarm model status llama-7b
gswarm model status llama-7b --node node1# Create data chunk
gswarm data create --source s3://bucket/data --device dram
# List data chunks
gswarm data list
# Transfer data to another node
gswarm data transfer chunk-123 --to node2:dram-
Host Node: Central coordinator
- Model registry management
- Task orchestration
- Global resource tracking
- API gateway
-
Client Nodes: Worker nodes
- Local model storage
- Model serving
- GPU profiling
- Task execution
- Data pool management
-
Communication:
- gRPC for high-performance metric streaming
- HTTP REST API for control and management
- WebSocket for real-time updates
Default ports used by gswarm:
- gRPC Server: 8090 (profiling metrics)
- HTTP API: 8091 (control panel)
- Model API: 9010 (model management)
- Model Services: 8080+ (dynamic allocation)
# Host management
gswarm host start [--port PORT] [--http-port HTTP_PORT]
gswarm host stop
gswarm host status
# System overview
gswarm status # Overall system status
gswarm nodes # List all nodes
gswarm health # Health check# Profiling operations
gswarm profiler start [--name NAME] [--freq FREQ]
gswarm profiler stop [--name NAME]
gswarm profiler status
gswarm profiler sessions # List all sessions
gswarm profiler recover # Recover crashed sessions
# Analysis
gswarm profiler analyze --data <file.json> --plot <output.pdf># Model management
gswarm model list [--location LOCATION]
gswarm model info <model_name>
gswarm model register <model_name> --type TYPE --source URL
# Model operations
gswarm model download <model_name> [--device DEVICE]
gswarm model move <model_name> --from SOURCE --to DEST [--keep-source]
gswarm model copy <model_name> --from SOURCE --to DEST
gswarm model delete <model_name> --device DEVICE
# Model serving
gswarm model serve <model_name> --device DEVICE [--port PORT]
gswarm model stop <model_name>
gswarm model services # List all running services# Data pool management
gswarm data list [--device DEVICE]
gswarm data create --source SOURCE --device DEVICE
gswarm data info <chunk_id>
gswarm data move <chunk_id> --to DEVICE
gswarm data transfer <chunk_id> --to NODE:DEVICE
gswarm data delete <chunk_id># Task queue management
gswarm queue status
gswarm queue tasks [--status STATUS]
gswarm queue cancel <task_id>
gswarm queue history [--limit N]# List models
GET /api/v1/models
# Get model info
GET /api/v1/models/{model_name}
# Register model
POST /api/v1/models
# Download model
POST /api/v1/models/{model_name}/download
# Move model
POST /api/v1/models/{model_name}/move
# Start serving
POST /api/v1/services
# Get service status
GET /api/v1/services/{service_id}/status# List data chunks
GET /api/v1/data
# Create data chunk
POST /api/v1/data
# Get chunk info
GET /api/v1/data/{chunk_id}
# Move data
POST /api/v1/data/{chunk_id}/move
# Transfer data
POST /api/v1/data/{chunk_id}/transfer# Get queue status
GET /api/v1/queue
# Get task details
GET /api/v1/queue/tasks/{task_id}
# Cancel task
POST /api/v1/queue/tasks/{task_id}/cancel
# Get history
GET /api/v1/queue/history~/.gswarm/config.yaml
cluster:
host: "master.cluster.local"
port: 8090
profiling:
default_frequency: 1000
enable_bandwidth: true
enable_nvlink: false
models:
storage_path: "/data/models"
cache_size: "100GB"
queue:
max_concurrent_tasks: 4
task_timeout: 3600
retry_count: 3
nodes:
- name: "node1"
address: "192.168.1.101"
capabilities:
gpus: ["gpu0", "gpu1"]
storage:
disk: 1000000000000
dram: 64000000000
- name: "node2"
address: "192.168.1.102"
capabilities:
gpus: ["gpu0"]
storage:
disk: 500000000000
dram: 32000000000name: "distributed-deployment"
description: "Deploy model across multiple nodes"
actions:
# Download model to primary node
- action_id: "download"
action_type: "download"
model_name: "llama-7b"
target_device: "node1:disk"
# Replicate to other nodes
- action_id: "replicate_node2"
action_type: "copy"
model_name: "llama-7b"
source_device: "node1:disk"
target_device: "node2:disk"
dependencies: ["download"]
# Load models to GPUs
- action_id: "load_gpu_node1"
action_type: "move"
model_name: "llama-7b"
source_device: "node1:disk"
target_device: "node1:gpu0"
dependencies: ["download"]
- action_id: "load_gpu_node2"
action_type: "move"
model_name: "llama-7b"
source_device: "node2:disk"
target_device: "node2:gpu0"
dependencies: ["replicate_node2"]
# Start services
- action_id: "serve_node1"
action_type: "serve"
model_name: "llama-7b"
device: "node1:gpu0"
port: 8080
dependencies: ["load_gpu_node1"]
- action_id: "serve_node2"
action_type: "serve"
model_name: "llama-7b"
device: "node2:gpu0"
port: 8081
dependencies: ["load_gpu_node2"]name: "ml-pipeline"
description: "Process data through multiple models"
actions:
# Prepare input data
- action_id: "load_data"
action_type: "data_create"
source: "s3://bucket/input"
target_device: "node1:dram"
# First model processing
- action_id: "model1_process"
action_type: "inference"
model_name: "preprocessor"
input_data: "${load_data.chunk_id}"
output_device: "node1:dram"
dependencies: ["load_data"]
# Transfer intermediate data
- action_id: "transfer_data"
action_type: "data_transfer"
data_id: "${model1_process.output}"
target_device: "node2:dram"
dependencies: ["model1_process"]
# Second model processing
- action_id: "model2_process"
action_type: "inference"
model_name: "classifier"
input_data: "${transfer_data.chunk_id}"
output_device: "node2:dram"
dependencies: ["transfer_data"]# System health
gswarm health
# Node-specific health
gswarm node status node1
# Service health
gswarm model service-health llama-7bLogs are stored in ~/.gswarm/logs/:
host.log: Host node logsclient-<node>.log: Client node logsprofiler.log: Profiling session logsmodel.log: Model operation logs
-
Connection Issues
- Check firewall rules for ports 8090-8091, 9010-9011
- Verify network connectivity between nodes
- Use
--resilientflag for automatic reconnection
-
Model Download Failures
- Check internet connectivity
- Verify HuggingFace token if needed
- Check disk space on target device
-
GPU Memory Issues
- Monitor GPU memory with
gswarm profiler - Use model quantization for large models
- Distribute model across multiple GPUs
- Monitor GPU memory with
-
Task Queue Blockage
- Check task dependencies with
gswarm queue tasks - Look for resource conflicts
- Cancel stuck tasks with
gswarm queue cancel
- Check task dependencies with
If you're migrating from separate gswarm-profiler and gswarm-model:
-
Backup existing data:
cp -r ~/.gswarm_profiler_data ~/.gswarm_profiler_data.backup cp -r ~/.gswarm_model_data ~/.gswarm_model_data.backup
-
Update CLI commands:
gsprof→gswarm profilergsmodel→gswarm model
-
Update API endpoints:
- Model APIs now use
/api/v1/prefix - Same ports are maintained for compatibility
- Model APIs now use
See the Migration Guide for detailed instructions.
# Run all tests
pytest
# Run specific test suite
pytest tests/test_profiler.py
pytest tests/test_model.py
pytest tests/test_queue.py- Fork the repository
- Create a feature branch
- Make your changes
- Add tests
- Submit a pull request
MIT License - see LICENSE file for details
- Built on nvitop for GPU monitoring
- Inspired by distributed computing frameworks
- Thanks to all contributors
- Kubernetes operator for cluster deployment
- Web UI for cluster management
- Advanced scheduling algorithms
- Model optimization toolkit
- Integration with popular ML frameworks
- Multi-cloud support
For more information, see the documentation.