diff --git a/README.md b/README.md index eb5a2540c..6fccda4a5 100644 --- a/README.md +++ b/README.md @@ -113,13 +113,16 @@ Open http://localhost:7860 (Gradio) or http://localhost:8001 (API). ### 💡 Which Model Should I Choose? -| Your GPU VRAM | Recommended LM Model | Backend | Notes | -|---------------|---------------------|---------|-------| -| **≤6GB** | None (DiT only) | — | LM disabled by default; INT8 quantization + full CPU offload | -| **6-8GB** | `acestep-5Hz-lm-0.6B` | `pt` | Lightweight LM with PyTorch backend | -| **8-16GB** | `acestep-5Hz-lm-0.6B` / `1.7B` | `vllm` | 0.6B for 8-12GB, 1.7B for 12-16GB | -| **16-24GB** | `acestep-5Hz-lm-1.7B` | `vllm` | 4B available on 20GB+; no offload needed on 20GB+ | -| **≥24GB** | `acestep-5Hz-lm-4B` | `vllm` | Best quality, all models fit without offload | +| Your GPU VRAM | Recommended DiT | Recommended LM Model | Backend | Notes | +|---------------|----------------|---------------------|---------|-------| +| **≤6GB** | 2B turbo | None (DiT only) | — | LM disabled by default; INT8 quantization + full CPU offload | +| **6-8GB** | 2B turbo | `acestep-5Hz-lm-0.6B` | `pt` | Lightweight LM with PyTorch backend | +| **8-16GB** | 2B turbo/sft | `acestep-5Hz-lm-0.6B` / `1.7B` | `vllm` | 0.6B for 8-12GB, 1.7B for 12-16GB | +| **16-20GB** | 2B sft or XL turbo | `acestep-5Hz-lm-1.7B` | `vllm` | XL requires CPU offload below 20GB | +| **20-24GB** | XL turbo/sft | `acestep-5Hz-lm-1.7B` | `vllm` | XL fits without offload; 4B LM available | +| **≥24GB** | XL sft (or xl-base for extract/lego/complete) | `acestep-5Hz-lm-4B` | `vllm` | Best quality, all models fit without offload | + +> **XL (4B) models** (`acestep-v15-xl-*`) offer higher audio quality with ~9GB VRAM for weights (vs ~4.7GB for 2B). They require ≥12GB VRAM (with offload + quantization) or ≥20GB (without offload). All LM models are fully compatible with XL. The UI automatically selects the best configuration for your GPU. All settings (LM model, backend, offloading, quantization) are tier-aware and pre-configured. @@ -244,6 +247,16 @@ See also the **LoRA Training** tab in Gradio UI for one-click training, or [Grad | `acestep-v15-turbo` | ✅ | ✅ | ❌ | ❌ | 8 | ✅ | ✅ | ✅ | ✅ | ❌ | ❌ | ❌ | Very High | Medium | Medium | [Link](https://huggingface.co/ACE-Step/Ace-Step1.5) | | `acestep-v15-turbo-rl` | ✅ | ✅ | ✅ | ❌ | 8 | ✅ | ✅ | ✅ | ✅ | ❌ | ❌ | ❌ | Very High | Medium | Medium | To be released | +### XL (4B) DiT Models + +> XL models use a larger 4B-parameter DiT decoder (~9GB bf16) for higher audio quality. They require ≥12GB VRAM (with offload + quantization) or ≥20GB (without offload). All LM models are fully compatible. + +| DiT Model | Pre-Training | SFT | RL | CFG | Step | Refer audio | Text2Music | Cover | Repaint | Extract | Lego | Complete | Quality | Diversity | Fine-Tunability | Hugging Face | +|-----------|:------------:|:---:|:--:|:---:|:----:|:-----------:|:----------:|:-----:|:-------:|:-------:|:----:|:--------:|:-------:|:---------:|:---------------:|--------------| +| `acestep-v15-xl-base` | ✅ | ❌ | ❌ | ✅ | 50 | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | High | High | Easy | [Link](https://huggingface.co/ACE-Step/acestep-v15-xl-base) | +| `acestep-v15-xl-sft` | ✅ | ✅ | ❌ | ✅ | 50 | ✅ | ✅ | ✅ | ✅ | ❌ | ❌ | ❌ | Very High | Medium | Easy | [Link](https://huggingface.co/ACE-Step/acestep-v15-xl-sft) | +| `acestep-v15-xl-turbo` | ✅ | ✅ | ❌ | ❌ | 8 | ✅ | ✅ | ✅ | ✅ | ❌ | ❌ | ❌ | Very High | Medium | Medium | [Link](https://huggingface.co/ACE-Step/acestep-v15-xl-turbo) | + ### LM Models | LM Model | Pretrain from | Pre-Training | SFT | RL | CoT metas | Query rewrite | Audio Understanding | Composition Capability | Copy Melody | Hugging Face | diff --git a/docs/en/GPU_COMPATIBILITY.md b/docs/en/GPU_COMPATIBILITY.md index cbe266ac7..c2de671c8 100644 --- a/docs/en/GPU_COMPATIBILITY.md +++ b/docs/en/GPU_COMPATIBILITY.md @@ -4,16 +4,18 @@ ACE-Step 1.5 automatically adapts to your GPU's available VRAM, adjusting genera ## GPU Tier Configuration -| VRAM | Tier | LM Models | Recommended LM | Backend | Max Duration (LM / No LM) | Max Batch (LM / No LM) | Offload | Quantization | -|------|------|-----------|-----------------|---------|----------------------------|-------------------------|---------|--------------| -| ≤4GB | Tier 1 | None | — | pt | 4 min / 6 min | 1 / 1 | CPU + DiT | INT8 | -| 4-6GB | Tier 2 | None | — | pt | 8 min / 10 min | 1 / 1 | CPU + DiT | INT8 | -| 6-8GB | Tier 3 | 0.6B | 0.6B | pt | 8 min / 10 min | 2 / 2 | CPU + DiT | INT8 | -| 8-12GB | Tier 4 | 0.6B | 0.6B | vllm | 8 min / 10 min | 2 / 4 | CPU + DiT | INT8 | -| 12-16GB | Tier 5 | 0.6B, 1.7B | 1.7B | vllm | 8 min / 10 min | 4 / 4 | CPU | INT8 | -| 16-20GB | Tier 6a | 0.6B, 1.7B | 1.7B | vllm | 8 min / 10 min | 4 / 8 | CPU | INT8 | -| 20-24GB | Tier 6b | 0.6B, 1.7B, 4B | 1.7B | vllm | 8 min / 8 min | 8 / 8 | None | None | -| ≥24GB | Unlimited | All (0.6B, 1.7B, 4B) | 4B | vllm | 10 min / 10 min | 8 / 8 | None | None | +| VRAM | Tier | XL (4B) DiT | LM Models | Recommended LM | Backend | Max Duration (LM / No LM) | Max Batch (LM / No LM) | Offload | Quantization | +|------|------|:-----------:|-----------|-----------------|---------|----------------------------|-------------------------|---------|--------------| +| ≤4GB | Tier 1 | ❌ | None | — | pt | 4 min / 6 min | 1 / 1 | CPU + DiT | INT8 | +| 4-6GB | Tier 2 | ❌ | None | — | pt | 8 min / 10 min | 1 / 1 | CPU + DiT | INT8 | +| 6-8GB | Tier 3 | ❌ | 0.6B | 0.6B | pt | 8 min / 10 min | 2 / 2 | CPU + DiT | INT8 | +| 8-12GB | Tier 4 | ❌ | 0.6B | 0.6B | vllm | 8 min / 10 min | 2 / 4 | CPU + DiT | INT8 | +| 12-16GB | Tier 5 | ⚠️ | 0.6B, 1.7B | 1.7B | vllm | 8 min / 10 min | 4 / 4 | CPU | INT8 | +| 16-20GB | Tier 6a | ✅ (offload) | 0.6B, 1.7B | 1.7B | vllm | 8 min / 10 min | 4 / 8 | CPU | INT8 | +| 20-24GB | Tier 6b | ✅ | 0.6B, 1.7B, 4B | 1.7B | vllm | 8 min / 8 min | 8 / 8 | None | None | +| ≥24GB | Unlimited | ✅ | All (0.6B, 1.7B, 4B) | 4B | vllm | 10 min / 10 min | 8 / 8 | None | None | + +> **XL (4B) DiT column**: ❌ = not supported, ⚠️ = marginal (offload + quantization required, reduced batch; works on 12-16GB with aggressive offload), ✅ (offload) = supported with CPU offload, ✅ = fully supported. XL models use ~9GB VRAM for weights (vs ~4.7GB for 2B). All LM models are compatible with XL. ### Column Descriptions @@ -75,8 +77,8 @@ The detection happens at startup in `acestep/gpu_config.py` (`is_legacy_cuda_gpu 1. **Very Low VRAM (≤6GB)**: Use DiT-only mode without LM initialization. INT8 quantization and full CPU offload are mandatory. VAE decode may fall back to CPU automatically. 2. **Low VRAM (6-8GB)**: The 0.6B LM model can be used with `pt` backend. Keep offload enabled. 3. **Medium VRAM (8-16GB)**: Use the 0.6B or 1.7B LM model. `vllm` backend works well on Tier 4+. -4. **High VRAM (16-24GB)**: Enable larger LM models (1.7B recommended). Quantization becomes optional on 20GB+. -5. **Very High VRAM (≥24GB)**: All models fit without offloading or quantization. Use 4B LM for best quality. +4. **High VRAM (16-24GB)**: Enable larger LM models (1.7B recommended). Quantization becomes optional on 20GB+. XL (4B) DiT models are supported — with offload on 16GB, without offload on 20GB+. +5. **Very High VRAM (≥24GB)**: All models fit without offloading or quantization. Use XL DiT + 4B LM for best quality. ## Debug Mode: Simulating Different GPU Configurations diff --git a/docs/en/INSTALL.md b/docs/en/INSTALL.md index c054838c8..668c5dd83 100644 --- a/docs/en/INSTALL.md +++ b/docs/en/INSTALL.md @@ -650,9 +650,14 @@ python -m acestep.model_downloader --all # Download all models # Main model (vae, Qwen3-Embedding-0.6B, acestep-v15-turbo, acestep-5Hz-lm-1.7B) huggingface-cli download ACE-Step/Ace-Step1.5 --local-dir ./checkpoints -# Optional models +# Optional LM models huggingface-cli download ACE-Step/acestep-5Hz-lm-0.6B --local-dir ./checkpoints/acestep-5Hz-lm-0.6B huggingface-cli download ACE-Step/acestep-5Hz-lm-4B --local-dir ./checkpoints/acestep-5Hz-lm-4B + +# XL (4B) DiT models - requires ≥12GB VRAM (with offload) +huggingface-cli download ACE-Step/acestep-v15-xl-base --local-dir ./checkpoints/acestep-v15-xl-base +huggingface-cli download ACE-Step/acestep-v15-xl-sft --local-dir ./checkpoints/acestep-v15-xl-sft +huggingface-cli download ACE-Step/acestep-v15-xl-turbo --local-dir ./checkpoints/acestep-v15-xl-turbo ``` ### Available Models @@ -667,6 +672,9 @@ huggingface-cli download ACE-Step/acestep-5Hz-lm-4B --local-dir ./checkpoints/ac | acestep-v15-turbo-shift1 | Turbo DiT with shift1 | [Link](https://huggingface.co/ACE-Step/acestep-v15-turbo-shift1) | | acestep-v15-turbo-shift3 | Turbo DiT with shift3 | [Link](https://huggingface.co/ACE-Step/acestep-v15-turbo-shift3) | | acestep-v15-turbo-continuous | Turbo DiT with continuous shift (1-5) | [Link](https://huggingface.co/ACE-Step/acestep-v15-turbo-continuous) | +| **acestep-v15-xl-base** | XL (4B) Base DiT — higher quality, ≥12GB VRAM | [Link](https://huggingface.co/ACE-Step/acestep-v15-xl-base) | +| **acestep-v15-xl-sft** | XL (4B) SFT DiT — higher quality, ≥12GB VRAM | [Link](https://huggingface.co/ACE-Step/acestep-v15-xl-sft) | +| **acestep-v15-xl-turbo** | XL (4B) Turbo DiT — higher quality, ≥12GB VRAM | [Link](https://huggingface.co/ACE-Step/acestep-v15-xl-turbo) | --- @@ -674,13 +682,14 @@ huggingface-cli download ACE-Step/acestep-5Hz-lm-4B --local-dir ./checkpoints/ac ACE-Step automatically adapts to your GPU's VRAM. The UI pre-configures all settings (LM model, backend, offloading, quantization) based on your detected GPU tier: -| Your GPU VRAM | Recommended LM Model | Backend | Notes | -|---------------|---------------------|---------|-------| -| **≤6GB** | None (DiT only) | — | LM disabled by default; INT8 quantization + full CPU offload | -| **6-8GB** | `acestep-5Hz-lm-0.6B` | `pt` | Lightweight LM with PyTorch backend | -| **8-16GB** | `0.6B` / `1.7B` | `vllm` | 0.6B for 8-12GB, 1.7B for 12-16GB | -| **16-24GB** | `acestep-5Hz-lm-1.7B` | `vllm` | 4B available on 20GB+; no offload on 20GB+ | -| **≥24GB** | `acestep-5Hz-lm-4B` | `vllm` | Best quality, all models fit without offload | +| Your GPU VRAM | Recommended DiT | Recommended LM Model | Backend | Notes | +|---------------|----------------|---------------------|---------|-------| +| **≤6GB** | 2B turbo | None (DiT only) | — | LM disabled; INT8 quantization + full CPU offload | +| **6-8GB** | 2B turbo | `acestep-5Hz-lm-0.6B` | `pt` | Lightweight LM with PyTorch backend | +| **8-16GB** | 2B turbo/sft | `0.6B` / `1.7B` | `vllm` | 0.6B for 8-12GB, 1.7B for 12-16GB | +| **16-20GB** | 2B sft or XL turbo | `acestep-5Hz-lm-1.7B` | `vllm` | XL requires CPU offload below 20GB | +| **20-24GB** | XL turbo/sft | `acestep-5Hz-lm-1.7B` | `vllm` | XL fits without offload; 4B LM available | +| **≥24GB** | XL sft (or xl-base for extract/lego/complete) | `acestep-5Hz-lm-4B` | `vllm` | Best quality, all models fit without offload | > 📖 For detailed GPU compatibility information (tier table, duration limits, batch sizes, adaptive UI defaults, memory optimization), see [GPU Compatibility Guide](GPU_COMPATIBILITY.md). diff --git a/docs/en/Tutorial.md b/docs/en/Tutorial.md index 49d03a445..8cdef731a 100644 --- a/docs/en/Tutorial.md +++ b/docs/en/Tutorial.md @@ -188,7 +188,7 @@ Based on your hardware: With a planning scheme, you still need to choose an executor. DiT is the core of ACE-Step 1.5—it handles various tasks and decides how to interpret LM-generated codes. -We've open-sourced **4 Turbo models**, **1 SFT model**, and **1 Base model**. +We've open-sourced **4 Turbo models**, **1 SFT model**, and **1 Base model** — plus their **XL (4B)** counterparts for higher audio quality. #### Turbo Series (Recommended for Daily Use) @@ -250,6 +250,18 @@ This greatly expands **customization and playability**—train a model unique to > For the detailed LoRA training guide, see the [LoRA Training Tutorial](./LoRA_Training_Tutorial.md). You can also use the "LoRA Training" tab in Gradio UI for one-click training. +#### XL (4B) Models + +XL models use a larger 4B-parameter DiT decoder for higher audio quality. They come in the same three variants (base, sft, turbo) and behave identically — just with better generation quality. **All LM models (0.6B / 1.7B / 4B) are fully compatible with XL.** + +**Requirements:** XL models need ~9GB VRAM for weights (vs ~4.7GB for 2B). Minimum 12GB VRAM with offload + quantization, 20GB+ recommended. + +| XL Model (full name) | Steps | CFG | VRAM | Notes | +|----------------------|:-----:|:---:|:----:|-------| +| `acestep-v15-xl-turbo` | 8 | ❌ | ≥12GB | Fast + high quality, best daily driver on 20GB+ GPUs | +| `acestep-v15-xl-sft` | 50 | ✅ | ≥12GB | Highest quality, tunable CFG | +| `acestep-v15-xl-base` | 50 | ✅ | ≥12GB | All tasks including extract/lego/complete | + #### DiT Selection Summary | Model | Steps | CFG | Speed | Exclusive Tasks | Recommended Scenarios | @@ -257,6 +269,9 @@ This greatly expands **customization and playability**—train a model unique to | `turbo` (default) | 8 | ❌ | ⚡⚡⚡ | — | Daily use, rapid iteration | | `sft` | 50 | ✅ | ⚡ | — | Pursuing details, like tuning | | `base` | 50 | ✅ | ⚡ | extract, lego, complete | Special tasks, large-scale fine-tuning | +| **`acestep-v15-xl-turbo`** | 8 | ❌ | ⚡⚡ | — | Best daily driver on 20GB+ GPUs | +| **`acestep-v15-xl-sft`** | 50 | ✅ | ⚡ | — | Highest quality, ≥12GB VRAM | +| **`acestep-v15-xl-base`** | 50 | ✅ | ⚡ | extract, lego, complete | All tasks with higher quality, ≥12GB VRAM | ### Combination Strategies diff --git a/docs/ja/GPU_COMPATIBILITY.md b/docs/ja/GPU_COMPATIBILITY.md index e1862296b..ac4c9e14f 100644 --- a/docs/ja/GPU_COMPATIBILITY.md +++ b/docs/ja/GPU_COMPATIBILITY.md @@ -4,16 +4,18 @@ ACE-Step 1.5 は GPU の VRAM に自動的に適応し、生成時間の制限 ## GPU ティア構成 -| VRAM | ティア | LM モデル | 推奨 LM | バックエンド | 最大時間 (LM有 / LM無) | 最大バッチ (LM有 / LM無) | オフロード | 量子化 | -|------|--------|-----------|---------|-------------|------------------------|--------------------------|------------|--------| -| ≤4GB | Tier 1 | なし | — | pt | 4分 / 6分 | 1 / 1 | CPU + DiT | INT8 | -| 4-6GB | Tier 2 | なし | — | pt | 8分 / 10分 | 1 / 1 | CPU + DiT | INT8 | -| 6-8GB | Tier 3 | 0.6B | 0.6B | pt | 8分 / 10分 | 2 / 2 | CPU + DiT | INT8 | -| 8-12GB | Tier 4 | 0.6B | 0.6B | vllm | 8分 / 10分 | 2 / 4 | CPU + DiT | INT8 | -| 12-16GB | Tier 5 | 0.6B, 1.7B | 1.7B | vllm | 8分 / 10分 | 4 / 4 | CPU | INT8 | -| 16-20GB | Tier 6a | 0.6B, 1.7B | 1.7B | vllm | 8分 / 10分 | 4 / 8 | CPU | INT8 | -| 20-24GB | Tier 6b | 0.6B, 1.7B, 4B | 1.7B | vllm | 8分 / 8分 | 8 / 8 | なし | なし | -| ≥24GB | 無制限 | 全モデル (0.6B, 1.7B, 4B) | 4B | vllm | 10分 / 10分 | 8 / 8 | なし | なし | +| VRAM | ティア | XL (4B) DiT | LM モデル | 推奨 LM | バックエンド | 最大時間 (LM有 / LM無) | 最大バッチ (LM有 / LM無) | オフロード | 量子化 | +|------|--------|:-----------:|-----------|---------|-------------|------------------------|--------------------------|------------|--------| +| ≤4GB | Tier 1 | ❌ | なし | — | pt | 4分 / 6分 | 1 / 1 | CPU + DiT | INT8 | +| 4-6GB | Tier 2 | ❌ | なし | — | pt | 8分 / 10分 | 1 / 1 | CPU + DiT | INT8 | +| 6-8GB | Tier 3 | ❌ | 0.6B | 0.6B | pt | 8分 / 10分 | 2 / 2 | CPU + DiT | INT8 | +| 8-12GB | Tier 4 | ❌ | 0.6B | 0.6B | vllm | 8分 / 10分 | 2 / 4 | CPU + DiT | INT8 | +| 12-16GB | Tier 5 | ⚠️ | 0.6B, 1.7B | 1.7B | vllm | 8分 / 10分 | 4 / 4 | CPU | INT8 | +| 16-20GB | Tier 6a | ✅ (オフロード) | 0.6B, 1.7B | 1.7B | vllm | 8分 / 10分 | 4 / 8 | CPU | INT8 | +| 20-24GB | Tier 6b | ✅ | 0.6B, 1.7B, 4B | 1.7B | vllm | 8分 / 8分 | 8 / 8 | なし | なし | +| ≥24GB | 無制限 | ✅ | 全モデル (0.6B, 1.7B, 4B) | 4B | vllm | 10分 / 10分 | 8 / 8 | なし | なし | + +> **XL (4B) DiT 列**: ❌ = 非対応, ⚠️ = 限定的(オフロード + 量子化が必要、12-16GBでは積極的なオフロードで動作可能), ✅ (オフロード) = CPUオフロードで対応, ✅ = 完全対応。XLモデルの重みは約9GB(bf16)、2Bは約4.7GB。すべてのLMモデルがXLと互換性があります。 ### 列の説明 diff --git a/docs/ja/INSTALL.md b/docs/ja/INSTALL.md index 0e647b4f1..e7f9953f5 100644 --- a/docs/ja/INSTALL.md +++ b/docs/ja/INSTALL.md @@ -531,13 +531,14 @@ huggingface-cli download ACE-Step/acestep-5Hz-lm-4B --local-dir ./checkpoints/ac ACE-Step は GPU の VRAM に自動適応します。UI は検出された GPU ティアに基づいてすべての設定(LM モデル、バックエンド、オフロード、量子化)を事前構成します: -| GPU VRAM | 推奨 LM モデル | バックエンド | 備考 | -|----------|---------------|-------------|------| -| **≤6GB** | なし(DiTのみ) | — | LM はデフォルトで無効;INT8 量子化 + 完全 CPU オフロード | -| **6-8GB** | `acestep-5Hz-lm-0.6B` | `pt` | 軽量 LM、PyTorch バックエンド | -| **8-16GB** | `0.6B` / `1.7B` | `vllm` | 8-12GB は 0.6B、12-16GB は 1.7B | -| **16-24GB** | `acestep-5Hz-lm-1.7B` | `vllm` | 20GB+ で 4B 利用可能;20GB+ でオフロード不要 | -| **≥24GB** | `acestep-5Hz-lm-4B` | `vllm` | 最高品質、すべてのモデルがオフロードなしで動作 | +| GPU VRAM | 推奨 DiT | 推奨 LM モデル | バックエンド | 備考 | +|----------|---------|---------------|-------------|------| +| **≤6GB** | 2B turbo | なし(DiTのみ) | — | LM はデフォルトで無効;INT8 量子化 + 完全 CPU オフロード | +| **6-8GB** | 2B turbo | `acestep-5Hz-lm-0.6B` | `pt` | 軽量 LM、PyTorch バックエンド | +| **8-16GB** | 2B turbo/sft | `0.6B` / `1.7B` | `vllm` | 8-12GB は 0.6B、12-16GB は 1.7B | +| **16-20GB** | 2B sft または XL turbo | `acestep-5Hz-lm-1.7B` | `vllm` | XL は 20GB 未満で CPU オフロードが必要 | +| **20-24GB** | XL turbo/sft | `acestep-5Hz-lm-1.7B` | `vllm` | XL はオフロード不要;4B LM 利用可能 | +| **≥24GB** | XL sft(extract/lego/complete には xl-base) | `acestep-5Hz-lm-4B` | `vllm` | 最高品質、すべてのモデルがオフロードなしで動作 | > 📖 GPU 互換性の詳細(ティアテーブル、時間制限、バッチサイズ、アダプティブ UI デフォルト、メモリ最適化)は [GPU 互換性ガイド](GPU_COMPATIBILITY.md) を参照してください。 diff --git a/docs/ja/Tutorial.md b/docs/ja/Tutorial.md index daa2082f9..e1f17b175 100644 --- a/docs/ja/Tutorial.md +++ b/docs/ja/Tutorial.md @@ -250,6 +250,18 @@ Baseは**タスクの集大成**で、SFTとTurboを超える3つの独占タス > LoRA訓練の詳細ガイドについては、[LoRA トレーニングチュートリアル](./LoRA_Training_Tutorial.md)を参照してください。Gradio UIの「LoRA Training」タブからワンクリックで訓練することもできます。 +#### XL (4B) モデル + +XLモデルは、より大きな4BパラメータのDiTデコーダを使用し、生成品質が向上します。標準モデルと同じ3つのバリアント(base、sft、turbo)で、動作は完全に同じです。**すべてのLMモデル(0.6B / 1.7B / 4B)がXLと完全互換です。** + +**必要条件:** XLモデルの重みは約9GB(bf16)、2Bは約4.7GB。最小12GB VRAM(オフロード + 量子化が必要)、20GB+推奨。 + +| XLモデル(フルネーム) | ステップ | CFG | VRAM | 備考 | +|----------------------|:-------:|:---:|:----:|------| +| `acestep-v15-xl-turbo` | 8 | ❌ | ≥12GB | 高速 + 高品質、20GB+ GPUの最適な日常選択 | +| `acestep-v15-xl-sft` | 50 | ✅ | ≥12GB | 最高品質、CFG調整可能 | +| `acestep-v15-xl-base` | 50 | ✅ | ≥12GB | extract/lego/completeを含む全タスク対応 | + #### DiT選択のまとめ | モデル | ステップ | CFG | 速度 | 独占タスク | 推奨シナリオ | @@ -257,6 +269,9 @@ Baseは**タスクの集大成**で、SFTとTurboを超える3つの独占タス | `turbo`(デフォルト) | 8 | ❌ | ⚡⚡⚡ | — | 日常使用、迅速な反復 | | `sft` | 50 | ✅ | ⚡ | — | 詳細を追求、調整が好き | | `base` | 50 | ✅ | ⚡ | extract, lego, complete | 特殊タスク、大規模微調整 | +| **`acestep-v15-xl-turbo`** | 8 | ❌ | ⚡⚡ | — | 20GB+ GPUの最適な日常選択 | +| **`acestep-v15-xl-sft`** | 50 | ✅ | ⚡ | — | 最高品質、≥12GB VRAM | +| **`acestep-v15-xl-base`** | 50 | ✅ | ⚡ | extract, lego, complete | より高品質な全タスク対応、≥12GB VRAM | ### 組み合わせ戦略 diff --git a/docs/ko/GPU_COMPATIBILITY.md b/docs/ko/GPU_COMPATIBILITY.md index 0ceb9b6fe..82eb581f9 100644 --- a/docs/ko/GPU_COMPATIBILITY.md +++ b/docs/ko/GPU_COMPATIBILITY.md @@ -4,16 +4,18 @@ ACE-Step 1.5는 GPU의 사용 가능한 VRAM에 자동으로 적응하여 생성 ## GPU 티어 구성 -| VRAM | 티어 | LM 모델 | 추천 LM | 백엔드 | 최대 길이 (LM 사용 / 미사용) | 최대 배치 (LM 사용 / 미사용) | 오프로드 | 양자화 | -|------|------|---------|---------|--------|------------------------------|------------------------------|----------|--------| -| ≤4GB | 티어 1 | 없음 | — | pt | 4분 / 6분 | 1 / 1 | CPU + DiT | INT8 | -| 4-6GB | 티어 2 | 없음 | — | pt | 8분 / 10분 | 1 / 1 | CPU + DiT | INT8 | -| 6-8GB | 티어 3 | 0.6B | 0.6B | pt | 8분 / 10분 | 1 / 2 | CPU + DiT | INT8 | -| 8-12GB | 티어 4 | 0.6B | 0.6B | vllm | 8분 / 10분 | 2 / 4 | CPU + DiT | INT8 | -| 12-16GB | 티어 5 | 0.6B, 1.7B | 1.7B | vllm | 8분 / 10분 | 2 / 4 | CPU | INT8 | -| 16-20GB | 티어 6a | 0.6B, 1.7B | 1.7B | vllm | 8분 / 10분 | 4 / 8 | CPU | INT8 | -| 20-24GB | 티어 6b | 0.6B, 1.7B, 4B | 1.7B | vllm | 8분 / 8분 | 4 / 8 | 없음 | 없음 | -| ≥24GB | 제한 없음 | 전체 (0.6B, 1.7B, 4B) | 4B | vllm | 10분 / 10분 | 8 / 8 | 없음 | 없음 | +| VRAM | 티어 | XL (4B) DiT | LM 모델 | 추천 LM | 백엔드 | 최대 길이 (LM 사용 / 미사용) | 최대 배치 (LM 사용 / 미사용) | 오프로드 | 양자화 | +|------|------|:-----------:|---------|---------|--------|------------------------------|------------------------------|----------|--------| +| ≤4GB | 티어 1 | ❌ | 없음 | — | pt | 4분 / 6분 | 1 / 1 | CPU + DiT | INT8 | +| 4-6GB | 티어 2 | ❌ | 없음 | — | pt | 8분 / 10분 | 1 / 1 | CPU + DiT | INT8 | +| 6-8GB | 티어 3 | ❌ | 0.6B | 0.6B | pt | 8분 / 10분 | 2 / 2 | CPU + DiT | INT8 | +| 8-12GB | 티어 4 | ❌ | 0.6B | 0.6B | vllm | 8분 / 10분 | 2 / 4 | CPU + DiT | INT8 | +| 12-16GB | 티어 5 | ⚠️ | 0.6B, 1.7B | 1.7B | vllm | 8분 / 10분 | 4 / 4 | CPU | INT8 | +| 16-20GB | 티어 6a | ✅ (오프로드) | 0.6B, 1.7B | 1.7B | vllm | 8분 / 10분 | 4 / 8 | CPU | INT8 | +| 20-24GB | 티어 6b | ✅ | 0.6B, 1.7B, 4B | 1.7B | vllm | 8분 / 8분 | 8 / 8 | 없음 | 없음 | +| ≥24GB | 제한 없음 | ✅ | 전체 (0.6B, 1.7B, 4B) | 4B | vllm | 10분 / 10분 | 8 / 8 | 없음 | 없음 | + +> **XL (4B) DiT 열**: ❌ = 미지원, ⚠️ = 제한적 (오프로드 + 양자화 필요, 12-16GB에서 적극적 오프로드로 동작 가능), ✅ (오프로드) = CPU 오프로드로 지원, ✅ = 완전 지원. XL 모델 가중치 약 9GB (bf16), 2B는 약 4.7GB. 모든 LM 모델이 XL과 호환됩니다. ### 열 설명 diff --git a/docs/ko/Tutorial.md b/docs/ko/Tutorial.md index b1336f0df..047ce47eb 100644 --- a/docs/ko/Tutorial.md +++ b/docs/ko/Tutorial.md @@ -231,6 +231,29 @@ Base 모델은 **모든 작업의 마스터**이며, SFT나 Turbo에는 없는 3 또한 Base 모델은 **가소성이 가장 높습니다.** 대규모 미세 조정이 필요한 경우 Base 모델로 실험을 시작하는 것이 좋습니다. +#### XL (4B) 모델 + +XL 모델은 더 큰 4B 파라미터 DiT 디코더를 사용하여 생성 품질이 향상됩니다. 표준 모델과 동일한 세 가지 변형(base, sft, turbo)으로 제공되며, 동작은 완전히 동일합니다. **모든 LM 모델(0.6B / 1.7B / 4B)이 XL과 완전 호환됩니다.** + +**하드웨어 요구사항:** XL 모델 가중치 약 9GB (bf16), 2B는 약 4.7GB. 최소 12GB VRAM (오프로드 + 양자화 필요), 20GB+ 권장. + +| XL 모델 (전체 이름) | 스텝 | CFG | VRAM | 비고 | +|--------------------|:----:|:---:|:----:|------| +| `acestep-v15-xl-turbo` | 8 | ❌ | ≥12GB | 빠르고 고품질, 20GB+ GPU의 최적 일상 선택 | +| `acestep-v15-xl-sft` | 50 | ✅ | ≥12GB | 최고 품질, CFG 조정 가능 | +| `acestep-v15-xl-base` | 50 | ✅ | ≥12GB | extract/lego/complete 포함 모든 작업 지원 | + +#### DiT 선택 요약 + +| 모델 | 스텝 | CFG | 속도 | 독점 작업 | 추천 시나리오 | +|------|:----:|:---:|:----:|----------|------------| +| `turbo` (기본) | 8 | ❌ | ⚡⚡⚡ | — | 일상 사용, 빠른 반복 | +| `sft` | 50 | ✅ | ⚡ | — | 디테일 추구, 조정 선호 | +| `base` | 50 | ✅ | ⚡ | extract, lego, complete | 특수 작업, 대규모 미세 조정 | +| **`acestep-v15-xl-turbo`** | 8 | ❌ | ⚡⚡ | — | 20GB+ GPU의 최적 일상 선택 | +| **`acestep-v15-xl-sft`** | 50 | ✅ | ⚡ | — | 최고 품질, ≥12GB VRAM | +| **`acestep-v15-xl-base`** | 50 | ✅ | ⚡ | extract, lego, complete | 더 높은 품질의 전체 작업 지원, ≥12GB VRAM | + --- ## 코끼리 가이드하기: 무엇을 제어할 수 있나요? diff --git a/docs/zh/GPU_COMPATIBILITY.md b/docs/zh/GPU_COMPATIBILITY.md index cb448b181..02254d6a5 100644 --- a/docs/zh/GPU_COMPATIBILITY.md +++ b/docs/zh/GPU_COMPATIBILITY.md @@ -4,16 +4,18 @@ ACE-Step 1.5 会自动适配您的 GPU 显存大小,相应调整生成时长 ## GPU 分级配置 -| 显存 | 等级 | LM 模型 | 推荐 LM | 后端 | 最大时长 (有LM / 无LM) | 最大批次 (有LM / 无LM) | 卸载策略 | 量化 | -|------|------|---------|---------|------|------------------------|------------------------|----------|------| -| ≤4GB | Tier 1 | 无 | — | pt | 4分 / 6分 | 1 / 1 | CPU + DiT | INT8 | -| 4-6GB | Tier 2 | 无 | — | pt | 8分 / 10分 | 1 / 1 | CPU + DiT | INT8 | -| 6-8GB | Tier 3 | 0.6B | 0.6B | pt | 8分 / 10分 | 2 / 2 | CPU + DiT | INT8 | -| 8-12GB | Tier 4 | 0.6B | 0.6B | vllm | 8分 / 10分 | 2 / 4 | CPU + DiT | INT8 | -| 12-16GB | Tier 5 | 0.6B, 1.7B | 1.7B | vllm | 8分 / 10分 | 4 / 4 | CPU | INT8 | -| 16-20GB | Tier 6a | 0.6B, 1.7B | 1.7B | vllm | 8分 / 10分 | 4 / 8 | CPU | INT8 | -| 20-24GB | Tier 6b | 0.6B, 1.7B, 4B | 1.7B | vllm | 8分 / 8分 | 8 / 8 | 无 | 无 | -| ≥24GB | 无限制 | 全部 (0.6B, 1.7B, 4B) | 4B | vllm | 10分 / 10分 | 8 / 8 | 无 | 无 | +| 显存 | 等级 | XL (4B) DiT | LM 模型 | 推荐 LM | 后端 | 最大时长 (有LM / 无LM) | 最大批次 (有LM / 无LM) | 卸载策略 | 量化 | +|------|------|:-----------:|---------|---------|------|------------------------|------------------------|----------|------| +| ≤4GB | Tier 1 | ❌ | 无 | — | pt | 4分 / 6分 | 1 / 1 | CPU + DiT | INT8 | +| 4-6GB | Tier 2 | ❌ | 无 | — | pt | 8分 / 10分 | 1 / 1 | CPU + DiT | INT8 | +| 6-8GB | Tier 3 | ❌ | 0.6B | 0.6B | pt | 8分 / 10分 | 2 / 2 | CPU + DiT | INT8 | +| 8-12GB | Tier 4 | ❌ | 0.6B | 0.6B | vllm | 8分 / 10分 | 2 / 4 | CPU + DiT | INT8 | +| 12-16GB | Tier 5 | ⚠️ | 0.6B, 1.7B | 1.7B | vllm | 8分 / 10分 | 4 / 4 | CPU | INT8 | +| 16-20GB | Tier 6a | ✅ (卸载) | 0.6B, 1.7B | 1.7B | vllm | 8分 / 10分 | 4 / 8 | CPU | INT8 | +| 20-24GB | Tier 6b | ✅ | 0.6B, 1.7B, 4B | 1.7B | vllm | 8分 / 8分 | 8 / 8 | 无 | 无 | +| ≥24GB | 无限制 | ✅ | 全部 (0.6B, 1.7B, 4B) | 4B | vllm | 10分 / 10分 | 8 / 8 | 无 | 无 | + +> **XL (4B) DiT 列**: ❌ = 不支持, ⚠️ = 勉强可用(需卸载 + 量化,12-16GB 可通过激进卸载运行),✅ (卸载) = 需 CPU 卸载,✅ = 完全支持。XL 模型权重约 9GB(bf16),2B 约 4.7GB。所有 LM 模型均兼容 XL。 ### 列说明 diff --git a/docs/zh/INSTALL.md b/docs/zh/INSTALL.md index 6ca1d4a11..8d266e26a 100644 --- a/docs/zh/INSTALL.md +++ b/docs/zh/INSTALL.md @@ -531,13 +531,14 @@ huggingface-cli download ACE-Step/acestep-5Hz-lm-4B --local-dir ./checkpoints/ac ACE-Step 会自动适配你的 GPU 显存。UI 会根据检测到的 GPU 等级预配置所有设置(LM 模型、后端、卸载、量化): -| GPU 显存 | 推荐 LM 模型 | 后端 | 说明 | -|----------|--------------|------|------| -| **≤6GB** | 无(仅 DiT) | — | 默认禁用 LM;INT8 量化 + 完全 CPU 卸载 | -| **6-8GB** | `acestep-5Hz-lm-0.6B` | `pt` | 轻量 LM,PyTorch 后端 | -| **8-16GB** | `0.6B` / `1.7B` | `vllm` | 8-12GB 用 0.6B,12-16GB 用 1.7B | -| **16-24GB** | `acestep-5Hz-lm-1.7B` | `vllm` | 20GB+ 可用 4B;20GB+ 无需卸载 | -| **≥24GB** | `acestep-5Hz-lm-4B` | `vllm` | 最佳质量,所有模型无需卸载 | +| GPU 显存 | 推荐 DiT | 推荐 LM 模型 | 后端 | 说明 | +|----------|---------|--------------|------|------| +| **≤6GB** | 2B turbo | 无(仅 DiT) | — | 默认禁用 LM;INT8 量化 + 完全 CPU 卸载 | +| **6-8GB** | 2B turbo | `acestep-5Hz-lm-0.6B` | `pt` | 轻量 LM,PyTorch 后端 | +| **8-16GB** | 2B turbo/sft | `0.6B` / `1.7B` | `vllm` | 8-12GB 用 0.6B,12-16GB 用 1.7B | +| **16-20GB** | 2B sft 或 XL turbo | `acestep-5Hz-lm-1.7B` | `vllm` | XL 在 20GB 以下需要 CPU 卸载 | +| **20-24GB** | XL turbo/sft | `acestep-5Hz-lm-1.7B` | `vllm` | XL 无需卸载;可用 4B LM | +| **≥24GB** | XL sft(或 xl-base 用于 extract/lego/complete) | `acestep-5Hz-lm-4B` | `vllm` | 最佳质量,所有模型无需卸载 | > 📖 详细 GPU 兼容性信息(等级表、时长限制、批量大小、自适应 UI 默认设置、显存优化),请参阅 [GPU 兼容性指南](GPU_COMPATIBILITY.md)。 diff --git a/docs/zh/Tutorial.md b/docs/zh/Tutorial.md index 723b13cd3..cbac383b6 100644 --- a/docs/zh/Tutorial.md +++ b/docs/zh/Tutorial.md @@ -245,6 +245,18 @@ Base 是**任务的集大成者**,比 SFT 和 Turbo 多出三个独占任务 > 关于 LoRA 训练的详细指南,请参阅 [LoRA 训练教程](./LoRA_Training_Tutorial.md)。你也可以使用 Gradio UI 中的「LoRA Training」标签页进行一键训练。 +#### XL (4B) 模型 + +XL 模型使用更大的 4B 参数 DiT 解码器,生成质量更高。提供与标准模型相同的三个变体(base、sft、turbo),行为完全一致。**所有 LM 模型(0.6B / 1.7B / 4B)均完全兼容 XL。** + +**硬件要求:** XL 模型权重约 9GB(bf16),2B 约 4.7GB。最低 12GB 显存(需卸载 + 量化),推荐 20GB+。 + +| XL 模型(全名) | 步数 | CFG | 显存 | 说明 | +|-----------------|:----:|:---:|:----:|------| +| `acestep-v15-xl-turbo` | 8 | ❌ | ≥12GB | 快速 + 高质量,20GB+ GPU 的最佳日常选择 | +| `acestep-v15-xl-sft` | 50 | ✅ | ≥12GB | 最高质量,可调 CFG | +| `acestep-v15-xl-base` | 50 | ✅ | ≥12GB | 支持所有任务(extract/lego/complete) | + #### DiT 选择总结 | 模型 | 步数 | CFG | 速度 | 独占任务 | 推荐场景 | @@ -252,6 +264,9 @@ Base 是**任务的集大成者**,比 SFT 和 Turbo 多出三个独占任务 | `turbo`(默认) | 8 | ❌ | ⚡⚡⚡ | — | 日常使用,快速迭代 | | `sft` | 50 | ✅ | ⚡ | — | 追求细节,喜欢调参 | | `base` | 50 | ✅ | ⚡ | extract, lego, complete | 特殊任务,大规模微调 | +| **`acestep-v15-xl-turbo`** | 8 | ❌ | ⚡⚡ | — | 20GB+ GPU 的最佳日常选择 | +| **`acestep-v15-xl-sft`** | 50 | ✅ | ⚡ | — | 最高质量,≥12GB 显存 | +| **`acestep-v15-xl-base`** | 50 | ✅ | ⚡ | extract, lego, complete | 更高质量的全任务支持,≥12GB 显存 | ### 组合搭配