From e4858990d84297bd4f2e17c5f409891ad4e55324 Mon Sep 17 00:00:00 2001
From: xiaompen <xiaompen@smci355-ccs-aus-n13-13.cs-aus.dcgpu>
Date: Thu, 22 Jan 2026 21:34:58 -0500
Subject: [PATCH 1/5] docs: add Primus CLI unified entry ROCm tech blog

---
 .../primus_cli_unified_entry_rocm.md.md       | 168 ++++++++++++++++++
 1 file changed, 168 insertions(+)
 create mode 100644 docs/tech_blogs/primus_cli_unified_entry_rocm.md.md
diff --git a/docs/tech_blogs/primus_cli_unified_entry_rocm.md.md b/docs/tech_blogs/primus_cli_unified_entry_rocm.md.md
new file mode 100644
index 000000000..1c801e0b9
--- /dev/null
+++ b/docs/tech_blogs/primus_cli_unified_entry_rocm.md.md
@@ -0,0 +1,168 @@
+<!---
+Copyright (c) 2025 Advanced Micro Devices, Inc. (AMD)
+
+Permission is hereby granted, free of charge, to any person obtaining a copy
+of this software and associated documentation files (the "Software"), to deal
+in the Software without restriction, including without limitation the rights
+to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+copies of the Software, and to permit persons to whom the Software is
+furnished to do so, subject to the following conditions:
+
+The above copyright notice and this permission notice shall be included in all
+copies or substantial portions of the Software.
+
+THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+SOFTWARE.
+--->
+
+## Introduction
+
+Training large-scale models on modern GPU clusters involves far more than model code alone.
+Engineers routinely deal with environment validation, performance benchmarking, distributed launch configuration, and multi-node debugging—often before a single training step is executed.
+
+As these workflows grow more complex, a recurring challenge emerges: **the lack of a consistent entry point**.
+Different scripts are used for training, benchmarks, and diagnostics; execution semantics vary between local runs, containers, and Slurm jobs; and small environment differences can lead to non-reproducible behavior.
+
+Primus CLI was designed to address this problem by providing **a unified entry point for training-related workflows on ROCm**, while preserving consistent execution semantics across environments.
+
+This post focuses on the design ideas behind Primus CLI and how they translate into practical benefits for large-scale training workflows.
+
+---
+
+## The problem: fragmented execution paths
+
+In many training setups, the workflow evolves organically:
+
+- one script for local debugging  
+- another wrapper for container execution  
+- a separate Slurm launcher for multi-node runs  
+- additional scripts for benchmarks and environment checks  
+
+Over time, these scripts diverge. Flags drift, environment variables differ, and assumptions become implicit.
+The result is a workflow that works, but is difficult to reason about, reproduce, or extend.
+
+Primus CLI approaches this problem from a different angle: instead of adding another wrapper, it treats training, benchmarking, and diagnostics as variations of the same execution model.
+
+---
+
+## A unified entry, not a monolithic tool
+
+At the surface, Primus CLI exposes a single command family with structured subcommands:
+
+```bash
+primus-cli <runtime> -- <task> [task options]
+```
+
+This structure allows different workflows—training, benchmarks, preflight checks—to share the same mental model, without forcing them into a single rigid pipeline.
+
+Examples:
+
+```bash
+# Training
+primus-cli direct -- train pretrain --config exp.yaml
+
+# Benchmarks
+primus-cli direct -- benchmark gemm -M 8192 -N 8192 -K 8192
+
+# Environment inspection
+primus-cli direct -- preflight --host --gpu --network
+```
+
+The goal is not to hide complexity, but to make it predictable.
+
+---
+
+## Why preserving the execution path matters
+
+In large-scale training systems, many failures and performance regressions are not caused by model code itself, but by subtle differences in how jobs are launched across environments.
+
+When local runs, container executions, and Slurm jobs follow different execution paths, debugging becomes guesswork. Performance numbers become difficult to compare, and fixes that work in one environment may not generalize to others.
+
+By preserving a single execution path and varying only the runtime preparation layer, Primus CLI makes training behavior easier to reason about and results easier to trust.
+
+---
+
+## Preserving execution semantics across environments
+
+A core design goal of Primus CLI is preserving the same execution semantics across environments.
+
+Whether a job is launched locally, inside a container, or on a multi-node Slurm cluster, the task semantics remain the same. Only the runtime preparation layer changes.
+
+```bash
+# Local
+primus-cli direct -- benchmark gemm -M 4096 -N 4096 -K 4096
+
+# Container
+primus-cli container --image rocm/primus:v25.10 -- benchmark gemm -M 4096 -N 4096 -K 4096
+
+# Slurm
+primus-cli slurm srun -N 2 -- benchmark gemm -M 16384 -N 16384 -K 16384
+```
+
+This consistency simplifies debugging and helps avoid the classic “works locally but not on the cluster” scenario.
+
+---
+
+## Designed for growth and experimentation
+
+Training workflows rarely stay static. New benchmarks are added, diagnostics evolve, and fine-tuning or post-training stages become part of the workflow.
+
+Primus CLI is designed to accommodate this evolution:
+
+- new tasks can be added as subcommands  
+- benchmarks can grow independently  
+- workflow-specific logic can be composed without rewriting launchers  
+
+This modular approach allows the CLI surface to expand while keeping the core execution model stable.
+
+---
+
+## The CLI as part of the training system
+
+In practice, the CLI is not just a launcher—it is part of the training system itself.
+
+Decisions made at the CLI layer affect reproducibility, debuggability, and how quickly new workflows can be introduced. Treating the CLI as a first-class component helps prevent training systems from accumulating brittle glue code over time.
+
+---
+
+## Python-first, HPC-aware
+
+Primus CLI uses Python as the orchestration layer, which integrates naturally with configuration files, training frameworks, and developer tooling.
+
+At the same time, it treats HPC realities—such as Slurm scheduling and multi-node execution—as first-class concerns rather than special cases.
+The result is a workflow that feels natural in Python environments, yet remains practical on large clusters.
+
+---
+
+## What this enables in practice
+
+In day-to-day usage, a unified entry point translates into tangible benefits:
+
+- reproducibility across local, container, and cluster runs  
+- simpler debugging due to consistent logs and execution paths  
+- lower onboarding cost for new users and contributors  
+- less glue code maintained outside the core workflow  
+
+Teams can focus more on training behavior and performance, and less on maintaining launch scripts.
+
+---
+
+## Who this is for
+
+Primus CLI is designed for teams who:
+
+- run training workflows across multiple environments  
+- care about reproducibility and debuggability at scale  
+- want to evolve training pipelines without rewriting launch logic  
+
+---
+
+## Closing
+
+Primus CLI aims to provide a reliable and consistent entry point for training workflows on ROCm.
+By unifying training, benchmarking, and environment inspection under a single execution model, it reduces operational friction while remaining compatible with large-scale, Slurm-based HPC environments.

From 70dcb0c654ee9800ed3aaed7cd3d4681cd7efd0c Mon Sep 17 00:00:00 2001
From: xiaompen <xiaompen@smci355-ccs-aus-n13-13.cs-aus.dcgpu>
Date: Thu, 22 Jan 2026 21:38:18 -0500
Subject: [PATCH 2/5] docs: add Primus CLI design philosophy tech blogs (EN/ZH)

---
 ...cli-design-philosophy-and-advantages.en.md | 137 ++++++++++++++++++
 ...cli-design-philosophy-and-advantages.zh.md | 135 +++++++++++++++++
 2 files changed, 272 insertions(+)
 create mode 100644 docs/tech_blogs/primus-cli-design-philosophy-and-advantages.en.md
 create mode 100644 docs/tech_blogs/primus-cli-design-philosophy-and-advantages.zh.md

diff --git a/docs/tech_blogs/primus-cli-design-philosophy-and-advantages.en.md b/docs/tech_blogs/primus-cli-design-philosophy-and-advantages.en.md
new file mode 100644
index 000000000..a6cd59ae1
--- /dev/null
+++ b/docs/tech_blogs/primus-cli-design-philosophy-and-advantages.en.md
@@ -0,0 +1,137 @@
+---
+title: "Primus CLI: Design Philosophy and Advantages"
+date: "2026-01-23"
+tags: ["Primus", "CLI", "ROCm", "LLM Training", "HPC", "Slurm", "Developer Tools"]
+---
+
+## Why a unified CLI matters for large-scale training
+
+As large-scale model training stacks evolve, one persistent problem remains: launching an experiment reliably is often harder than writing the training code. The complexity shows up in environment differences (local vs container vs Slurm), distributed settings, GPU/network topology, and a growing set of “side tasks” (benchmarks, preflight checks, diagnostics).
+
+Primus CLI was built to address this problem by providing **a unified, consistent entry point** that consolidates training, benchmarking, and environment checks into one command structure—while keeping the execution path consistent across environments.
+
+This post focuses on **design philosophy** and **practical advantages**. For usage details, see `docs/cli/README.md` and the full guide `docs/cli/PRIMUS-CLI-GUIDE.md`.
+
+## Design principles
+
+### 1) Unified entry, unified mental model
+
+In many training codebases, training, benchmarks, and preflight checks are launched via different scripts with different flags and different environment assumptions. Primus CLI unifies those workflows under a single CLI that is organized via subcommands.
+
+Examples:
+
+```bash
+primus-cli direct -- train posttrain --config exp.yaml
+primus-cli direct -- benchmark gemm -M 8192 -N 8192 -K 8192 --dtype bf16
+primus-cli direct -- preflight --host --gpu --network
+```
+
+Why this helps:
+
+- One command family to remember
+- Less duplicated “glue” logic across scripts
+- Lower onboarding friction for new users
+
+### 2) Preserved execution path across environments
+
+A core goal of Primus CLI is to **keep the execution entry consistent across environments**. Whether you run locally, in a container, or via Slurm, Primus keeps the same task semantics and code path, only changing the runtime preparation layer.
+
+Primus supports three execution modes (as documented in `docs/cli/README.md`):
+
+- **Direct**: quick validation, local development
+- **Container**: environment isolation and reproducibility
+- **Slurm**: multi-node distributed execution on HPC clusters
+
+The command structure stays stable:
+
+```bash
+# Local
+primus-cli direct -- benchmark gemm -M 4096 -N 4096 -K 4096
+
+# Container
+primus-cli container --image rocm/primus:v25.10 -- benchmark gemm -M 4096 -N 4096 -K 4096
+
+# Slurm
+primus-cli slurm srun -N 2 -- benchmark gemm -M 16384 -N 16384 -K 16384
+```
+
+Why this helps:
+
+- No diverging “local script” vs “cluster script”
+- Easier debugging (same entry path, similar logs)
+- Reduced environment pollution and fewer “works on my machine” issues
+
+### 3) Modular and extensible by design
+
+Primus CLI is designed to be extended without destabilizing the core. New tasks can be added as additional subcommands or suites, without rewriting the launcher or duplicating wrappers.
+
+In practice, this keeps the CLI core stable while allowing the tooling surface to grow with new needs (new benchmarks, new diagnostics, new training workflows).
+
+### 4) Python-first, Slurm-friendly
+
+Primus uses Python as the orchestration language (good fit for YAML configs, framework integration, and tooling), while keeping Slurm workflows first-class. In `runner/`, the runtime-specific launchers encapsulate environment preparation and scheduling details; the task semantics remain consistent.
+
+## How the architecture maps to the repository
+
+At a high level, Primus CLI follows a three-layer structure:
+
+### Runtime layer: direct / container / slurm
+
+The `runner/` directory contains the entrypoints and launchers that implement environment-specific behavior while preserving the same task structure. For example:
+
+- `runner/primus-cli`
+- `runner/primus-cli-direct.sh`
+- `runner/primus-cli-container.sh`
+- `runner/primus-cli-slurm.sh`
+- `runner/primus-cli-slurm-entry.sh`
+
+### Hook / patch layer: workflow composition without intrusion
+
+Training workflows often require pre/post steps (preflight checks, dependency installation, checkpoint preparation, hotfixes). Primus supports a hook/patch mechanism so these steps can be composed without modifying training code.
+
+This also helps keep behavior consistent across environments, because hooks are executed as part of the same preserved entry path.
+
+### Task execution layer: train / benchmark / preflight / analyze
+
+The task layer implements what users care about: training, micro-benchmarks, preflight checks, and analysis tools. It stays focused on “what to do,” while the runtime layer focuses on “where/how to run.”
+
+## Practical advantages
+
+- **Lower cognitive load**: one command family for multiple workflows
+- **Higher reproducibility**: stable semantics across local/container/Slurm
+- **Better debuggability**: fewer divergent code paths, more consistent logs
+- **Less glue code**: hooks/patches capture common pre/post steps
+- **Safer extensibility**: add new capabilities without rewriting the core
+
+## Example workflows
+
+### Training
+
+```bash
+primus-cli direct -- train posttrain --config examples/megatron_bridge/configs/MI355X/qwen3_8b_sft_posttrain.yaml
+```
+
+### Benchmarks
+
+```bash
+primus-cli direct -- benchmark gemm -M 8192 -N 8192 -K 8192 --dtype bf16
+primus-cli direct -- benchmark rccl --op allreduce --num-bytes 1048576
+```
+
+### Preflight checks
+
+```bash
+primus-cli direct -- preflight --host --gpu --network
+```
+
+## Roadmap (directional)
+
+- More backends and workflow types (beyond current training backends)
+- A more unified “training + fine-tuning” command surface
+- Diagnostics and auto-tuning tools (topology, RCCL tuning, profiling/reporting)
+- Curated reproducible examples (“recipes”) for popular models and clusters
+
+## Closing
+
+Primus CLI aims to be the most reliable entry point for AMD GPU training workflows by hiding environment complexity behind a unified interface—without sacrificing HPC realities like Slurm scheduling and multi-node debugging.
+
diff --git a/docs/tech_blogs/primus-cli-design-philosophy-and-advantages.zh.md b/docs/tech_blogs/primus-cli-design-philosophy-and-advantages.zh.md
new file mode 100644
index 000000000..b71c8bd7c
--- /dev/null
+++ b/docs/tech_blogs/primus-cli-design-philosophy-and-advantages.zh.md
@@ -0,0 +1,135 @@
+---
+title: "Primus CLI：设计理念与优势"
+date: "2026-01-23"
+tags: ["Primus", "CLI", "ROCm", "大模型训练", "HPC", "Slurm", "工程效率"]
+---
+
+## 为什么需要一个统一的训练入口
+
+大规模模型训练的难点，往往不在训练代码本身，而在“如何把一次实验稳定跑起来”：本机、容器、Slurm 集群之间的环境差异；分布式参数与拓扑差异；以及越来越多的“配套工作”（benchmarks、preflight 检查、诊断分析、热修复等）。
+
+Primus CLI 的目标就是解决这个问题：提供一个**统一且一致的命令入口**，把训练、benchmark、环境检查等工作流收敛到一个结构化的 CLI 中，同时尽可能保证在不同运行环境下走**同一条执行路径**。
+
+本文重点介绍 Primus CLI 的**设计理念**与**工程优势**。具体用法可参考 `docs/cli/README.md` 与完整指南 `docs/cli/PRIMUS-CLI-GUIDE.md`。
+
+## 设计理念
+
+### 1）统一入口，统一心智模型
+
+传统大规模训练项目里，训练、benchmark、preflight 往往由不同脚本启动；参数风格不一致、环境变量设置方式各不相同，最终会带来较高的使用与维护成本。Primus CLI 用子命令体系把这些入口统一起来。
+
+示例：
+
+```bash
+primus-cli direct -- train posttrain --config exp.yaml
+primus-cli direct -- benchmark gemm -M 8192 -N 8192 -K 8192 --dtype bf16
+primus-cli direct -- preflight --host --gpu --network
+```
+
+带来的收益：
+
+- 命令结构清晰、易记
+- 减少脚本间重复的“胶水逻辑”
+- 新同学上手更快
+
+### 2）跨环境保持同一路径（Preserved Execution Path）
+
+Primus CLI 的核心目标之一是：在本机、容器、Slurm 等不同运行环境下，尽可能保持同一条任务执行链路，只把差异收敛到运行时准备层。
+
+Primus 支持三种执行模式（见 `docs/cli/README.md`）：
+
+- **Direct**：本机/快速验证/调试
+- **Container**：隔离环境、保证依赖一致性
+- **Slurm**：HPC 集群多节点调度与分布式执行
+
+命令结构保持一致：
+
+```bash
+# 本机
+primus-cli direct -- benchmark gemm -M 4096 -N 4096 -K 4096
+
+# 容器
+primus-cli container --image rocm/primus:v25.10 -- benchmark gemm -M 4096 -N 4096 -K 4096
+
+# Slurm
+primus-cli slurm srun -N 2 -- benchmark gemm -M 16384 -N 16384 -K 16384
+```
+
+带来的收益：
+
+- 避免“本机脚本”和“集群脚本”分叉
+- 调试更容易复现（入口一致、日志路径更统一）
+- 更少环境污染与不确定性
+
+### 3）模块化、可扩展
+
+Primus CLI 的设计强调“核心稳定 + 能力可插拔”。新增任务（例如新的 benchmark suite、新的诊断工具、新的训练流程）应尽量以独立模块形式加入，而不是修改核心入口并复制/分叉已有逻辑。
+
+### 4）Python 优先，同时 Slurm 友好
+
+Primus 在编排层采用 Python（便于处理 YAML 配置、对接训练框架与工具链），同时对 Slurm 保持一等公民支持。在 `runner/` 下，运行时 launcher 把调度与环境准备的差异封装起来，让任务层只关心“做什么”。
+
+## 架构如何映射到代码仓库
+
+从工程结构上，Primus CLI 可以理解为三层：
+
+### Runtime 层：direct / container / slurm
+
+`runner/` 目录包含统一入口与不同运行时 launcher，例如：
+
+- `runner/primus-cli`
+- `runner/primus-cli-direct.sh`
+- `runner/primus-cli-container.sh`
+- `runner/primus-cli-slurm.sh`
+- `runner/primus-cli-slurm-entry.sh`
+
+这些脚本负责“在哪跑/怎么调度/怎么准备环境”，并把任务参数交给任务层执行。
+
+### Hook / Patch 层：把前后置流程从训练代码剥离
+
+大规模训练往往需要复杂的前后置步骤：安装依赖、准备 checkpoint、环境检查、热修复等。Primus 通过 hook/patch 机制把这类步骤做成可组合的流水线，从而减少对训练代码的侵入，也减少脚本复制粘贴。
+
+### Task 层：train / benchmark / preflight / analyze
+
+任务层负责具体的训练、benchmark、preflight、分析工具等逻辑；运行时层负责把任务以一致方式落到 direct/container/slurm 中执行。
+
+## 工程优势总结
+
+- **学习成本低**：统一入口覆盖多个工作流
+- **可复现性更强**：跨环境尽量保持同一路径
+- **更易调试**：减少分叉路径带来的日志/行为差异
+- **减少胶水代码**：hook/patch 复用通用前后置步骤
+- **扩展更安全**：新增能力不需要改核心入口
+
+## 使用示例
+
+### 训练
+
+```bash
+primus-cli direct -- train posttrain --config examples/megatron_bridge/configs/MI355X/qwen3_8b_sft_posttrain.yaml
+```
+
+### Benchmarks
+
+```bash
+primus-cli direct -- benchmark gemm -M 8192 -N 8192 -K 8192 --dtype bf16
+primus-cli direct -- benchmark rccl --op allreduce --num-bytes 1048576
+```
+
+### Preflight 环境检查
+
+```bash
+primus-cli direct -- preflight --host --gpu --network
+```
+
+## Roadmap（方向性）
+
+- 支持更多后端与任务类型
+- 更统一的训练与微调任务面（pretrain / sft / rlhf 等）
+- 诊断与自动调优（拓扑采集、通信调参、profiling 报告）
+- 沉淀可复现的模型与集群“配方化”示例
+
+## 结语
+
+Primus CLI 试图把训练工程中最昂贵的不确定性（环境差异、入口分叉、脚本重复、难以复现）系统性地下沉到可维护的架构中，让用户用一个一致的命令结构完成从开发到生产、从单机到多节点的迁移与复现。
+

From ca5bbffccea65c7fc1f5624678609ee788493a5a Mon Sep 17 00:00:00 2001
From: xiaoming-amd <xiaoming.peng@amd.com>
Date: Thu, 22 Jan 2026 21:43:06 -0500
Subject: [PATCH 3/5] rename file

---
 ..._unified_entry_rocm.md.md => primus_cli_unified_entry_rocm.md} | 0
 1 file changed, 0 insertions(+), 0 deletions(-)
 rename docs/tech_blogs/{primus_cli_unified_entry_rocm.md.md => primus_cli_unified_entry_rocm.md} (100%)

diff --git a/docs/tech_blogs/primus_cli_unified_entry_rocm.md.md b/docs/tech_blogs/primus_cli_unified_entry_rocm.md
similarity index 100%
rename from docs/tech_blogs/primus_cli_unified_entry_rocm.md.md
rename to docs/tech_blogs/primus_cli_unified_entry_rocm.md

From 23ca2c7738863247ff616f5868f24923e6d8c0bf Mon Sep 17 00:00:00 2001
From: xiaoming-amd <xiaoming.peng@amd.com>
Date: Thu, 22 Jan 2026 21:45:03 -0500
Subject: [PATCH 4/5] delete file

---
 ...cli-design-philosophy-and-advantages.en.md | 137 ------------------
 ...cli-design-philosophy-and-advantages.zh.md | 135 -----------------
 2 files changed, 272 deletions(-)
 delete mode 100644 docs/tech_blogs/primus-cli-design-philosophy-and-advantages.en.md
 delete mode 100644 docs/tech_blogs/primus-cli-design-philosophy-and-advantages.zh.md

diff --git a/docs/tech_blogs/primus-cli-design-philosophy-and-advantages.en.md b/docs/tech_blogs/primus-cli-design-philosophy-and-advantages.en.md
deleted file mode 100644
index a6cd59ae1..000000000
--- a/docs/tech_blogs/primus-cli-design-philosophy-and-advantages.en.md
+++ /dev/null
@@ -1,137 +0,0 @@
----
-title: "Primus CLI: Design Philosophy and Advantages"
-date: "2026-01-23"
-tags: ["Primus", "CLI", "ROCm", "LLM Training", "HPC", "Slurm", "Developer Tools"]
----
-
-## Why a unified CLI matters for large-scale training
-
-As large-scale model training stacks evolve, one persistent problem remains: launching an experiment reliably is often harder than writing the training code. The complexity shows up in environment differences (local vs container vs Slurm), distributed settings, GPU/network topology, and a growing set of “side tasks” (benchmarks, preflight checks, diagnostics).
-
-Primus CLI was built to address this problem by providing **a unified, consistent entry point** that consolidates training, benchmarking, and environment checks into one command structure—while keeping the execution path consistent across environments.
-
-This post focuses on **design philosophy** and **practical advantages**. For usage details, see `docs/cli/README.md` and the full guide `docs/cli/PRIMUS-CLI-GUIDE.md`.
-
-## Design principles
-
-### 1) Unified entry, unified mental model
-
-In many training codebases, training, benchmarks, and preflight checks are launched via different scripts with different flags and different environment assumptions. Primus CLI unifies those workflows under a single CLI that is organized via subcommands.
-
-Examples:
-
-```bash
-primus-cli direct -- train posttrain --config exp.yaml
-primus-cli direct -- benchmark gemm -M 8192 -N 8192 -K 8192 --dtype bf16
-primus-cli direct -- preflight --host --gpu --network
-```
-
-Why this helps:
-
-- One command family to remember
-- Less duplicated “glue” logic across scripts
-- Lower onboarding friction for new users
-
-### 2) Preserved execution path across environments
-
-A core goal of Primus CLI is to **keep the execution entry consistent across environments**. Whether you run locally, in a container, or via Slurm, Primus keeps the same task semantics and code path, only changing the runtime preparation layer.
-
-Primus supports three execution modes (as documented in `docs/cli/README.md`):
-
-- **Direct**: quick validation, local development
-- **Container**: environment isolation and reproducibility
-- **Slurm**: multi-node distributed execution on HPC clusters
-
-The command structure stays stable:
-
-```bash
-# Local
-primus-cli direct -- benchmark gemm -M 4096 -N 4096 -K 4096
-
-# Container
-primus-cli container --image rocm/primus:v25.10 -- benchmark gemm -M 4096 -N 4096 -K 4096
-
-# Slurm
-primus-cli slurm srun -N 2 -- benchmark gemm -M 16384 -N 16384 -K 16384
-```
-
-Why this helps:
-
-- No diverging “local script” vs “cluster script”
-- Easier debugging (same entry path, similar logs)
-- Reduced environment pollution and fewer “works on my machine” issues
-
-### 3) Modular and extensible by design
-
-Primus CLI is designed to be extended without destabilizing the core. New tasks can be added as additional subcommands or suites, without rewriting the launcher or duplicating wrappers.
-
-In practice, this keeps the CLI core stable while allowing the tooling surface to grow with new needs (new benchmarks, new diagnostics, new training workflows).
-
-### 4) Python-first, Slurm-friendly
-
-Primus uses Python as the orchestration language (good fit for YAML configs, framework integration, and tooling), while keeping Slurm workflows first-class. In `runner/`, the runtime-specific launchers encapsulate environment preparation and scheduling details; the task semantics remain consistent.
-
-## How the architecture maps to the repository
-
-At a high level, Primus CLI follows a three-layer structure:
-
-### Runtime layer: direct / container / slurm
-
-The `runner/` directory contains the entrypoints and launchers that implement environment-specific behavior while preserving the same task structure. For example:
-
-- `runner/primus-cli`
-- `runner/primus-cli-direct.sh`
-- `runner/primus-cli-container.sh`
-- `runner/primus-cli-slurm.sh`
-- `runner/primus-cli-slurm-entry.sh`
-
-### Hook / patch layer: workflow composition without intrusion
-
-Training workflows often require pre/post steps (preflight checks, dependency installation, checkpoint preparation, hotfixes). Primus supports a hook/patch mechanism so these steps can be composed without modifying training code.
-
-This also helps keep behavior consistent across environments, because hooks are executed as part of the same preserved entry path.
-
-### Task execution layer: train / benchmark / preflight / analyze
-
-The task layer implements what users care about: training, micro-benchmarks, preflight checks, and analysis tools. It stays focused on “what to do,” while the runtime layer focuses on “where/how to run.”
-
-## Practical advantages
-
-- **Lower cognitive load**: one command family for multiple workflows
-- **Higher reproducibility**: stable semantics across local/container/Slurm
-- **Better debuggability**: fewer divergent code paths, more consistent logs
-- **Less glue code**: hooks/patches capture common pre/post steps
-- **Safer extensibility**: add new capabilities without rewriting the core
-
-## Example workflows
-
-### Training
-
-```bash
-primus-cli direct -- train posttrain --config examples/megatron_bridge/configs/MI355X/qwen3_8b_sft_posttrain.yaml
-```
-
-### Benchmarks
-
-```bash
-primus-cli direct -- benchmark gemm -M 8192 -N 8192 -K 8192 --dtype bf16
-primus-cli direct -- benchmark rccl --op allreduce --num-bytes 1048576
-```
-
-### Preflight checks
-
-```bash
-primus-cli direct -- preflight --host --gpu --network
-```
-
-## Roadmap (directional)
-
-- More backends and workflow types (beyond current training backends)
-- A more unified “training + fine-tuning” command surface
-- Diagnostics and auto-tuning tools (topology, RCCL tuning, profiling/reporting)
-- Curated reproducible examples (“recipes”) for popular models and clusters
-
-## Closing
-
-Primus CLI aims to be the most reliable entry point for AMD GPU training workflows by hiding environment complexity behind a unified interface—without sacrificing HPC realities like Slurm scheduling and multi-node debugging.
-
diff --git a/docs/tech_blogs/primus-cli-design-philosophy-and-advantages.zh.md b/docs/tech_blogs/primus-cli-design-philosophy-and-advantages.zh.md
deleted file mode 100644
index b71c8bd7c..000000000
--- a/docs/tech_blogs/primus-cli-design-philosophy-and-advantages.zh.md
+++ /dev/null
@@ -1,135 +0,0 @@
----
-title: "Primus CLI：设计理念与优势"
-date: "2026-01-23"
-tags: ["Primus", "CLI", "ROCm", "大模型训练", "HPC", "Slurm", "工程效率"]
----
-
-## 为什么需要一个统一的训练入口
-
-大规模模型训练的难点，往往不在训练代码本身，而在“如何把一次实验稳定跑起来”：本机、容器、Slurm 集群之间的环境差异；分布式参数与拓扑差异；以及越来越多的“配套工作”（benchmarks、preflight 检查、诊断分析、热修复等）。
-
-Primus CLI 的目标就是解决这个问题：提供一个**统一且一致的命令入口**，把训练、benchmark、环境检查等工作流收敛到一个结构化的 CLI 中，同时尽可能保证在不同运行环境下走**同一条执行路径**。
-
-本文重点介绍 Primus CLI 的**设计理念**与**工程优势**。具体用法可参考 `docs/cli/README.md` 与完整指南 `docs/cli/PRIMUS-CLI-GUIDE.md`。
-
-## 设计理念
-
-### 1）统一入口，统一心智模型
-
-传统大规模训练项目里，训练、benchmark、preflight 往往由不同脚本启动；参数风格不一致、环境变量设置方式各不相同，最终会带来较高的使用与维护成本。Primus CLI 用子命令体系把这些入口统一起来。
-
-示例：
-
-```bash
-primus-cli direct -- train posttrain --config exp.yaml
-primus-cli direct -- benchmark gemm -M 8192 -N 8192 -K 8192 --dtype bf16
-primus-cli direct -- preflight --host --gpu --network
-```
-
-带来的收益：
-
-- 命令结构清晰、易记
-- 减少脚本间重复的“胶水逻辑”
-- 新同学上手更快
-
-### 2）跨环境保持同一路径（Preserved Execution Path）
-
-Primus CLI 的核心目标之一是：在本机、容器、Slurm 等不同运行环境下，尽可能保持同一条任务执行链路，只把差异收敛到运行时准备层。
-
-Primus 支持三种执行模式（见 `docs/cli/README.md`）：
-
-- **Direct**：本机/快速验证/调试
-- **Container**：隔离环境、保证依赖一致性
-- **Slurm**：HPC 集群多节点调度与分布式执行
-
-命令结构保持一致：
-
-```bash
-# 本机
-primus-cli direct -- benchmark gemm -M 4096 -N 4096 -K 4096
-
-# 容器
-primus-cli container --image rocm/primus:v25.10 -- benchmark gemm -M 4096 -N 4096 -K 4096
-
-# Slurm
-primus-cli slurm srun -N 2 -- benchmark gemm -M 16384 -N 16384 -K 16384
-```
-
-带来的收益：
-
-- 避免“本机脚本”和“集群脚本”分叉
-- 调试更容易复现（入口一致、日志路径更统一）
-- 更少环境污染与不确定性
-
-### 3）模块化、可扩展
-
-Primus CLI 的设计强调“核心稳定 + 能力可插拔”。新增任务（例如新的 benchmark suite、新的诊断工具、新的训练流程）应尽量以独立模块形式加入，而不是修改核心入口并复制/分叉已有逻辑。
-
-### 4）Python 优先，同时 Slurm 友好
-
-Primus 在编排层采用 Python（便于处理 YAML 配置、对接训练框架与工具链），同时对 Slurm 保持一等公民支持。在 `runner/` 下，运行时 launcher 把调度与环境准备的差异封装起来，让任务层只关心“做什么”。
-
-## 架构如何映射到代码仓库
-
-从工程结构上，Primus CLI 可以理解为三层：
-
-### Runtime 层：direct / container / slurm
-
-`runner/` 目录包含统一入口与不同运行时 launcher，例如：
-
-- `runner/primus-cli`
-- `runner/primus-cli-direct.sh`
-- `runner/primus-cli-container.sh`
-- `runner/primus-cli-slurm.sh`
-- `runner/primus-cli-slurm-entry.sh`
-
-这些脚本负责“在哪跑/怎么调度/怎么准备环境”，并把任务参数交给任务层执行。
-
-### Hook / Patch 层：把前后置流程从训练代码剥离
-
-大规模训练往往需要复杂的前后置步骤：安装依赖、准备 checkpoint、环境检查、热修复等。Primus 通过 hook/patch 机制把这类步骤做成可组合的流水线，从而减少对训练代码的侵入，也减少脚本复制粘贴。
-
-### Task 层：train / benchmark / preflight / analyze
-
-任务层负责具体的训练、benchmark、preflight、分析工具等逻辑；运行时层负责把任务以一致方式落到 direct/container/slurm 中执行。
-
-## 工程优势总结
-
-- **学习成本低**：统一入口覆盖多个工作流
-- **可复现性更强**：跨环境尽量保持同一路径
-- **更易调试**：减少分叉路径带来的日志/行为差异
-- **减少胶水代码**：hook/patch 复用通用前后置步骤
-- **扩展更安全**：新增能力不需要改核心入口
-
-## 使用示例
-
-### 训练
-
-```bash
-primus-cli direct -- train posttrain --config examples/megatron_bridge/configs/MI355X/qwen3_8b_sft_posttrain.yaml
-```
-
-### Benchmarks
-
-```bash
-primus-cli direct -- benchmark gemm -M 8192 -N 8192 -K 8192 --dtype bf16
-primus-cli direct -- benchmark rccl --op allreduce --num-bytes 1048576
-```
-
-### Preflight 环境检查
-
-```bash
-primus-cli direct -- preflight --host --gpu --network
-```
-
-## Roadmap（方向性）
-
-- 支持更多后端与任务类型
-- 更统一的训练与微调任务面（pretrain / sft / rlhf 等）
-- 诊断与自动调优（拓扑采集、通信调参、profiling 报告）
-- 沉淀可复现的模型与集群“配方化”示例
-
-## 结语
-
-Primus CLI 试图把训练工程中最昂贵的不确定性（环境差异、入口分叉、脚本重复、难以复现）系统性地下沉到可维护的架构中，让用户用一个一致的命令结构完成从开发到生产、从单机到多节点的迁移与复现。
-

From a917d0b38af8dbdca5bb5d83dc86c2ecd8fdc965 Mon Sep 17 00:00:00 2001
From: xiaoming-amd <xiaoming.peng@amd.com>
Date: Thu, 22 Jan 2026 21:51:38 -0500
Subject: [PATCH 5/5] docs: add front matter to Primus CLI ROCm tech blog

---
 docs/tech_blogs/primus_cli_unified_entry_rocm.md | 6 ++++++
 1 file changed, 6 insertions(+)

diff --git a/docs/tech_blogs/primus_cli_unified_entry_rocm.md b/docs/tech_blogs/primus_cli_unified_entry_rocm.md
index 1c801e0b9..acefd8db8 100644
--- a/docs/tech_blogs/primus_cli_unified_entry_rocm.md
+++ b/docs/tech_blogs/primus_cli_unified_entry_rocm.md
@@ -1,3 +1,9 @@
+---
+title: "Primus CLI: A Unified Entry Point for Training on ROCm"
+date: "2026-01-23"
+tags: ["ROCm", "LLM Training", "HPC", "Slurm", "Developer Tools"]
+---
+
 <!---
 Copyright (c) 2025 Advanced Micro Devices, Inc. (AMD)