Skip to content

streamdiffusion sdxl - Error in monitor loop (stuck worker) #835

@eliteprox

Description

@eliteprox

Describe the bug

I found an livepeer/ai-runner:live-app-streamdiffusion-sdxl container that was running for several hours and printing the following logs:

Traceback (most recent call last):
  File "/app/app/live/process/process_guardian.py", line 271, in _monitor_loop
    last_error = self.process.get_last_error()
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/app/app/live/process/process.py", line 449, in get_last_error
    last_error = self.error_queue.get_nowait()
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/workspace/miniconda3/envs/comfystream/lib/python3.11/multiprocessing/queues.py", line 135, in get_nowait
    return self.get(False)
           ^^^^^^^^^^^^^^^
  File "/workspace/miniconda3/envs/comfystream/lib/python3.11/multiprocessing/queues.py", line 100, in get
    raise ValueError(f"Queue {self!r} is closed")
ValueError: Queue <multiprocessing.queues.Queue object at 0x75bb4b35ca10> is closed
Stack (most recent call last):
  File "<frozen runpy>", line 198, in _run_module_as_main
  File "<frozen runpy>", line 88, in _run_code
  File "/app/app/live/infer.py", line 227, in <module>
    asyncio.run(
  File "/workspace/miniconda3/envs/comfystream/lib/python3.11/asyncio/runners.py", line 190, in run
    return runner.run(main)
  File "/workspace/miniconda3/envs/comfystream/lib/python3.11/asyncio/runners.py", line 118, in run
    return self._loop.run_until_complete(task)
  File "/workspace/miniconda3/envs/comfystream/lib/python3.11/asyncio/base_events.py", line 641, in run_until_complete
    self.run_forever()
  File "/workspace/miniconda3/envs/comfystream/lib/python3.11/asyncio/base_events.py", line 608, in run_forever
    self._run_once()
  File "/workspace/miniconda3/envs/comfystream/lib/python3.11/asyncio/base_events.py", line 1936, in _run_once
    handle._run()
  File "/workspace/miniconda3/envs/comfystream/lib/python3.11/asyncio/events.py", line 84, in _run
    self._context.run(self._callback, *self._args)
  File "/app/app/live/process/process_guardian.py", line 340, in _monitor_loop
    logging.exception("Error in monitor loop", stack_info=True)
timestamp=2025-11-02 04:23:35 level=ERROR location=process_guardian.py:340:_monitor_loop gateway_request_id=43f4f6e9 manifest_id=ef783006 stream_id=aiJobTesterStream-1762013066117295814 message=Error in monitor loop

Reproduction steps

Running go-livepeer v0.8.8 with livepeer/ai-runner:live-app-streamdiffusion-sdxl on dual 4090 system.

Docker inspect:
"Image": "sha256:f51676ce8332dbad414b9e3daa66a5f5a797e6d54682601d8d910c151a5e1748",
"Image": "livepeer/ai-runner:live-app-streamdiffusion-sdxl",

Commit: a201f99

Expected behaviour

_monitor_loop should not fail and /health endpoint should have reported the error state to allow container restart

Severity

None

Screenshots / Live demo link

No response

OS

None

Running on

None

AI-worker version

No response

Additional context

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions