Skip to content

Improve userTransfer.php for autonomous operation #202

@MagnaCapax

Description

@MagnaCapax

Executive Summary

userTransfer.php is a critical migration tool that has caused multiple incidents due to lack of observability, pre-flight checks, and autonomous operation support. This issue proposes a comprehensive redesign based on 5+ documented migration failures.

Documented Failure Modes

From sysadmin/memory/lessons/ — these are real incidents:

1. Silent Auth Failures (Fuckup Level: HIGH)

Incident: eyfqehxs migration burned 31 passes (hours) with zero bytes transferred because password didn't match. Nobody noticed until customer complained days later.
Root cause: No pre-flight auth check. Script optimistically starts all passes assuming auth works.

2. Password Coordination Failures

Incident: Source lockdown changed password without coordinating with PMSS_USER_TRANSFER_PASSWORD. Transfer silently failed.
Root cause: No validation that provided password actually works before starting.

3. Unmonitored Transfers

Incident: Screen session started and forgotten. Transfer stalled, nobody knew for days.
Root cause: No progress reporting, no health checks, no notifications.

4. Source Not Locked Down Properly

Incident: Tried removing .rtorrentExecuteRun — ineffective. Cron watchdog restarted rtorrent within 2 minutes. Data kept changing during sync.
Root cause: Lockdown logic not integrated into userTransfer. Operator must know to rename .rtorrent.rc.

5. Disk Space Exhaustion

Incident: Transfer started without checking if target had enough space. Failed mid-transfer.
Root cause: No pre-flight disk space check.

6. No Resume Capability

Incident: Transfer interrupted, had to restart from scratch.
Root cause: Each pass is independent; no stateful checkpointing.

Design Principles for Ideal Tool

For Agent Operation (Primary User)

  1. Fire-and-forget with confidence — Start once, get notification on completion/failure
  2. Observable — Structured status file that agents can poll
  3. Self-healing — Retry transient failures (network blips, temp disk full)
  4. Clear failure signals — Exit immediately on unrecoverable errors (auth failure)
  5. Predictable — Estimate completion time based on data size and transfer speed

For Customer Experience

  1. Fast — Maximize bandwidth (compression disabled, direct paths)
  2. Minimal disruption — Atomic cutover when ready
  3. Data integrity — Verify all files transferred correctly
  4. Communication — Provide progress updates for customer-facing replies

For Operator

  1. Hands-off — No babysitting required
  2. Alerting — Only notified on problems via webhook/Discord
  3. Audit trail — Full structured log of what happened
  4. Easy restart — Resume from checkpoint if interrupted

Proposed Architecture

Phase 1: Pre-Flight (BLOCKING)

userTransfer.php --preflight USER SOURCE

Before any data transfer:

  1. Auth check — Verify SSH/password works with a simple ls command
  2. Disk space — Compare source size vs target free space
  3. Network connectivity — Verify route to source
  4. Source status — Check if rtorrent/qbittorrent running (warn if source not locked)
  5. Target status — Verify user exists, home writable

Exit with clear error if ANY pre-flight fails. No wasted passes.

Phase 2: Source Lockdown (Optional, Integrated)

userTransfer.php --lockdown-source USER SOURCE

Integrated lockdown that actually works:

  1. SSH to source
  2. Rename .rtorrent.rc.rtorrent.rc.migration-disabled
  3. Rename .qbittorrentEnable.qbittorrentEnable.migration-disabled
  4. Kill running torrent processes
  5. Wait 2+ minutes
  6. Verify no respawn

Phase 3: Transfer with Progress

During transfer, maintain status file:

// /var/run/pmss/userTransfer-USER.json
{
  "user": "eyfqehxs",
  "source": "le4-0-105-210funnelout.pulsedmedia.com",
  "target": "le4-0-104-131symmetra.pulsedmedia.com",
  "started": "2026-02-04T05:09:00Z",
  "status": "running",
  "phase": "main",
  "pass": 5,
  "total_passes": 31,
  "bytes_transferred": 15000000000000,
  "bytes_total": 36000000000000,
  "percent": 41.7,
  "eta_seconds": 7200,
  "last_update": "2026-02-04T07:30:00Z",
  "errors": []
}

Agent can poll this file. No log parsing required.

Phase 4: Transfer Completion

On completion:

  1. Run final sync passes
  2. Run post-setup (permissions, ruTorrent rename)
  3. Verify data integrity (optional checksum mode)
  4. Restore source configs (optional, for rollback capability)
  5. Write completion status
  6. Send webhook notification (if configured)

Phase 5: Notification

userTransfer.php --webhook https://discord.webhook.url USER SOURCE

On completion/failure, POST structured payload:

{
  "event": "migration_complete",
  "user": "eyfqehxs",
  "source": "funnelout",
  "target": "symmetra",
  "duration_seconds": 14400,
  "bytes_transferred": 36000000000000,
  "status": "success"
}

CLI Interface

Usage:
  userTransfer.php [OPTIONS] LOCAL_USER REMOTE_HOST
  userTransfer.php --preflight LOCAL_USER REMOTE_HOST
  userTransfer.php --status LOCAL_USER
  userTransfer.php --abort LOCAL_USER

Options:
  --preflight           Run pre-flight checks only, no transfer
  --lockdown-source     Lock down source before transfer (rename configs, kill processes)
  --unlock-source       Restore source configs after transfer
  --status              Show current transfer status (JSON)
  --abort               Abort running transfer gracefully
  --webhook URL         Send completion/failure notification to URL
  --json-log            Output structured JSON logs (for agent parsing)
  --no-sleep            Disable sleep between passes (for fast internal transfers)
  --verify              Verify checksums after transfer (slow but safe)
  --resume              Resume from last checkpoint
  --password-from-stdin Read password from stdin (more secure than env)
  --daemon              Daemonize after starting (no screen needed)

Status File Location

/var/run/pmss/userTransfer-{USER}.json    # Current status
/var/run/pmss/userTransfer-{USER}.pid     # PID file for daemon mode
/var/log/pmss/userTransfer-{USER}.log     # Human-readable log
/var/log/pmss/userTransfer-{USER}.jsonl   # Machine-readable log (JSON lines)

Agent Workflow

# 1. Pre-flight (fails fast if anything wrong)
php /scripts/util/userTransfer.php --preflight eyfqehxs funnelout

# 2. Start transfer with webhook notification
php /scripts/util/userTransfer.php --daemon --webhook "$DISCORD_URL" \
  --lockdown-source eyfqehxs funnelout

# 3. Agent can check progress anytime
cat /var/run/pmss/userTransfer-eyfqehxs.json | jq '.percent'

# 4. Webhook fires on completion — agent doesn't need to poll

Minimum Viable Implementation (Phase 1)

If full redesign is too much, at minimum implement:

  1. --preflight flag — Auth + disk space check before transfer
  2. Exit code 2 for auth failure — Distinct from other errors
  3. Status file — Simple JSON with current pass/percent
  4. --abort flag — Graceful shutdown

These 4 changes would have prevented all documented incidents.

Related Lessons

  • sysadmin/memory/lessons/infrastructure/20260201-migration-eyfqehxs-multiple-fuckups.md
  • sysadmin/memory/lessons/infrastructure/20260204-usertransfer-stalled-migration-fuckup.md
  • sysadmin/memory/lessons/infrastructure/20260201-usertransfer-password-must-match-source.md
  • sysadmin/memory/lessons/operations/20260201-migration-source-lockdown-correct-method.md
  • sysadmin/memory/lessons/operations/20260201-penny-migration-lessons.md
  • sysadmin/memory/lessons/customer/20260201-migration-email-must-mention-transfer-time.md
  • sysadmin/memory/lessons/pmss/20260131-usertransfer-php-must-run-from-target-server.md

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions