Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
88 changes: 88 additions & 0 deletions skyrl-train/docs/LOGGING_IMPROVEMENTS.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,88 @@
# Logging Improvements for SkyRL

## Goal
Separate infrastructure logs from training progress logs:
- **Training progress** (config, steps, rewards, metrics, timings) → stdout
- **Infrastructure logs** (vLLM engine, model loading, KV cache) → log file only
- **SKYRL_LOG_LEVEL=DEBUG** → show all logs on stdout (for debugging)

## Implementation

### Approach: OS-level stdout/stderr redirection with `os.dup2`

The solution redirects stdout/stderr at the OS level for specific Ray actors (vLLM engines, workers), while leaving the training entrypoint unredirected so training progress reaches stdout.

### Files Modified

1. **`skyrl_train/env_vars.py`** - Added environment variables:
- `SKYRL_LOG_DIR` - Base directory for logs (default: `/tmp/skyrl-logs`)
- `SKYRL_LOG_LEVEL` - Log level; DEBUG shows all logs on stdout

2. **`skyrl_train/utils/ray_logging.py`** (new file) - Helper module:
```python
def redirect_actor_output_to_file():
"""Redirect stdout/stderr to log file."""
log_file = os.getenv("SKYRL_LOG_FILE")
if log_file:
log_fd = open(log_file, "a", buffering=1)
os.dup2(log_fd.fileno(), sys.stdout.fileno())
os.dup2(log_fd.fileno(), sys.stderr.fileno())
```

3. **`skyrl_train/utils/utils.py`**:
- `initialize_ray()` - Sets up log file path and passes to workers via runtime_env
- `configure_ray_worker_logging()` - Only configures loguru (no redirect here)

4. **`skyrl_train/inference_engines/vllm/vllm_engine.py`**:
- `VLLMInferenceEngine.__init__()` - Calls `redirect_actor_output_to_file()`
- `AsyncVLLMInferenceEngine.__init__()` - Calls `redirect_actor_output_to_file()`

5. **`skyrl_train/workers/worker.py`**:
- `BaseWorker.__init__()` - Calls `redirect_actor_output_to_file()`

### How It Works

1. `initialize_ray()` creates log directory and sets `SKYRL_LOG_FILE` env var
2. `log_to_driver=True` allows Ray to forward actor output to driver stdout
3. vLLM engines call `redirect_actor_output_to_file()` in `__init__` → logs go to file instead of being forwarded
4. Workers call `redirect_actor_output_to_file()` in `__init__` → logs go to file
5. `skyrl_entrypoint` does NOT redirect → training progress reaches stdout

### Environment Variables

| Variable | Default | Description |
|----------|---------|-------------|
| `SKYRL_LOG_DIR` | `/tmp/skyrl-logs` | Base directory for log files |
| `SKYRL_LOG_FILE` | (auto-set) | Full path to infra.log (set by initialize_ray) |
| `SKYRL_LOG_LEVEL` | `INFO` | Log level; `DEBUG` shows all logs on stdout |

## Results

### Before
- 300+ lines on stdout including vLLM model loading, KV cache allocation, tensor parallel setup, etc.

### After
- **Stdout**: Config dump, dataset loading, training progress (steps, metrics, rewards)
- **Log file** (`/tmp/skyrl-logs/{run_name}/infra.log`): vLLM engine logs, worker initialization, model loading

### Known Limitations

1. **Ray `(raylet)` logs still appear on stdout** - These are Ray system logs that occur before our actors start. They're relatively few lines and not the noisy vLLM output.

2. **Ray dedup messages** - Ray may show "repeated Nx across cluster" messages which are informational.

## Usage

```bash
# Normal run - training progress on stdout, infra logs to file
bash examples/gsm8k/run_gsm8k.sh

# Debug mode - all logs on stdout
SKYRL_LOG_LEVEL=DEBUG bash examples/gsm8k/run_gsm8k.sh

# Custom log directory
SKYRL_LOG_DIR=/path/to/logs bash examples/gsm8k/run_gsm8k.sh

# View infrastructure logs
tail -f /tmp/skyrl-logs/{run_name}/infra.log
```
20 changes: 20 additions & 0 deletions skyrl-train/skyrl_train/env_vars.py
Original file line number Diff line number Diff line change
Expand Up @@ -49,3 +49,23 @@

See https://github.com/ray-project/ray/issues/56697 for details on why this is needed.
"""

# ─────────────────────────────────────────────────────────────────────────────
# Logging
# ─────────────────────────────────────────────────────────────────────────────

SKYRL_LOG_LEVEL = os.environ.get("SKYRL_LOG_LEVEL", "INFO").upper()
"""
Log level for SkyRL. Controls log filtering and stdout verbosity.

- INFO (default): Training progress on stdout, infrastructure logs to file only
- DEBUG: All logs (including vLLM, Ray, workers) shown on stdout
- Also used by loguru for log level filtering (ERROR, WARNING, etc.)
"""

SKYRL_LOG_DIR = os.environ.get("SKYRL_LOG_DIR", "/tmp/skyrl-logs")
"""
Base directory for SkyRL log files (default: /tmp/skyrl-logs).

Infrastructure logs are written to: {SKYRL_LOG_DIR}/{run_name}/infra.log
"""
10 changes: 10 additions & 0 deletions skyrl-train/skyrl_train/inference_engines/vllm/vllm_engine.py
Original file line number Diff line number Diff line change
Expand Up @@ -236,6 +236,11 @@ class VLLMInferenceEngine(BaseVLLMInferenceEngine):
"""Synchronous VLLM engine."""

def __init__(self, *args, **kwargs):
# Redirect vLLM output to log file (prevents polluting driver stdout)
from skyrl_train.utils.ray_logging import redirect_actor_output_to_file

redirect_actor_output_to_file()

super().__init__(*args, **kwargs)
self._weight_loader = VLLMWeightLoader(self.llm, is_async=False)

Expand Down Expand Up @@ -345,6 +350,11 @@ class AsyncVLLMInferenceEngine(BaseVLLMInferenceEngine):
"""Asynchronous VLLM engine."""

def __init__(self, *args, **kwargs):
# Redirect vLLM output to log file (prevents polluting driver stdout)
from skyrl_train.utils.ray_logging import redirect_actor_output_to_file

redirect_actor_output_to_file()

super().__init__(*args, **kwargs)
self._weight_loader = VLLMWeightLoader(self.llm, is_async=True)

Expand Down
27 changes: 27 additions & 0 deletions skyrl-train/skyrl_train/utils/ray_logging.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,27 @@
"""
Helper to redirect Ray actor stdout/stderr to log file.

This prevents infrastructure logs from polluting the driver's stdout,
allowing only training progress to be displayed to the user.
"""

import os
import sys


def redirect_actor_output_to_file():
"""
Redirect stdout and stderr to log file to prevent Ray from forwarding to driver.

Call this at the very start of any Ray actor/remote function where you want
to suppress output from appearing on the driver's stdout. The output will
instead be written to the log file specified by SKYRL_LOG_FILE.

Note: Do NOT call this in skyrl_entrypoint() - training progress should
go to stdout.
"""
log_file = os.getenv("SKYRL_LOG_FILE")
if log_file:
log_fd = open(log_file, "a", buffering=1) # noqa: SIM115
os.dup2(log_fd.fileno(), sys.stdout.fileno())
os.dup2(log_fd.fileno(), sys.stderr.fileno())
Comment on lines +12 to +27
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The current implementation unconditionally redirects stdout/stderr if SKYRL_LOG_FILE is set. However, the documentation and PR description state that setting SKYRL_LOG_LEVEL=DEBUG should show all logs on stdout. To align with this behavior, you should avoid redirection when the log level is 'DEBUG'.

def redirect_actor_output_to_file():
    """
    Redirect stdout and stderr to log file to prevent Ray from forwarding to driver.

    Call this at the very start of any Ray actor/remote function where you want
    to suppress output from appearing on the driver's stdout. The output will
    instead be written to the log file specified by SKYRL_LOG_FILE.

    Note: Do NOT call this in skyrl_entrypoint() - training progress should
    go to stdout.
    """
    from skyrl_train.env_vars import SKYRL_LOG_LEVEL

    if SKYRL_LOG_LEVEL == "DEBUG":
        return

    log_file = os.getenv("SKYRL_LOG_FILE")
    if log_file:
        log_fd = open(log_file, "a", buffering=1)  # noqa: SIM115
        os.dup2(log_fd.fileno(), sys.stdout.fileno())
        os.dup2(log_fd.fileno(), sys.stderr.fileno())

42 changes: 37 additions & 5 deletions skyrl-train/skyrl_train/utils/utils.py
Original file line number Diff line number Diff line change
Expand Up @@ -17,7 +17,14 @@
placement_group_table,
)

from skyrl_train.env_vars import SKYRL_LD_LIBRARY_PATH_EXPORT, SKYRL_RAY_PG_TIMEOUT_IN_S, SKYRL_PYTHONPATH_EXPORT
from skyrl_train.env_vars import (
SKYRL_LD_LIBRARY_PATH_EXPORT,
SKYRL_LOG_DIR,
SKYRL_LOG_LEVEL,
SKYRL_PYTHONPATH_EXPORT,
SKYRL_RAY_PG_TIMEOUT_IN_S,
)
from pathlib import Path


class Timer:
Expand Down Expand Up @@ -595,9 +602,14 @@ def prepare_runtime_environment(cfg: DictConfig) -> dict[str, str]:

def configure_ray_worker_logging() -> None:
"""
In Ray workers, stderr/stdout are not TTYs, so Loguru disables color.
This method forces color and formatting (e.g., bold) and routes stdlib `logging`
through Loguru so third-party logs match formatting
Configure logging for Ray workers.

This method:
1. Forces color and formatting for Loguru (even without TTY)
2. Routes stdlib logging through Loguru

Note: This does NOT redirect stdout/stderr. For infra actors (vLLM, workers),
call redirect_actor_output_to_file() separately in their __init__.
"""
level_name = os.getenv("LOG_LEVEL", "INFO").upper()

Expand Down Expand Up @@ -640,8 +652,28 @@ def initialize_ray(cfg: DictConfig):
"""
from .ppo_utils import sync_registries

# Determine if we should show all logs on stdout (DEBUG mode)
verbose_logging = SKYRL_LOG_LEVEL == "DEBUG"

# Suppress Ray backend logs unless in verbose mode
if not verbose_logging:
os.environ["RAY_BACKEND_LOG_LEVEL"] = "fatal"

# Set up log file for infrastructure logs
log_dir = Path(SKYRL_LOG_DIR) / cfg.trainer.run_name
log_dir.mkdir(parents=True, exist_ok=True)
log_file = str(log_dir / "infra.log")
Comment on lines +663 to +665
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

security-high high

The initialize_ray function constructs the log_dir and log_file paths using SKYRL_LOG_DIR and cfg.trainer.run_name without proper validation. This creates a path traversal vulnerability where a malicious run_name (e.g., ../../etc/cron.daily) could lead to arbitrary file creation in sensitive system directories. Since log content can be influenced by model outputs, this poses a significant risk, including potential arbitrary code execution if the file is placed in an executable location (like cron.daily). To remediate this, ensure the constructed path is resolved and verified to be within the intended base directory. Additionally, the current implementation sets up log file redirection regardless of SKYRL_LOG_LEVEL. When SKYRL_LOG_LEVEL is DEBUG, an empty log file is still created and a potentially misleading message is printed. The log file setup should be conditional on SKYRL_LOG_LEVEL not being DEBUG to avoid this.

os.environ["SKYRL_LOG_FILE"] = log_file

env_vars = prepare_runtime_environment(cfg)
ray.init(runtime_env={"env_vars": env_vars})
# Pass log file path to workers so they can redirect their output
env_vars["SKYRL_LOG_FILE"] = log_file

# log_to_driver=True allows training progress from skyrl_entrypoint to reach stdout.
# Infrastructure logs (vLLM, workers) are redirected to log file via os.dup2 in their init.
ray.init(runtime_env={"env_vars": env_vars}, log_to_driver=True)

logger.info(f"Infrastructure logs will be written to: {log_file}")

# create the named ray actors for the registries to make available to all workers
sync_registries()
Expand Down
5 changes: 5 additions & 0 deletions skyrl-train/skyrl_train/workers/worker.py
Original file line number Diff line number Diff line change
Expand Up @@ -89,6 +89,11 @@ def __init__(
self.record_memory = record_memory
if record_memory:
torch.cuda.memory._record_memory_history()

# Redirect worker output to log file (infra logs shouldn't pollute driver stdout)
from skyrl_train.utils.ray_logging import redirect_actor_output_to_file

redirect_actor_output_to_file()
configure_ray_worker_logging()

def get_node_local_rank(self):
Expand Down