We introduce 📺 TVWorld, an offline graph-based abstraction of real-world TV navigation that enables reproducible and deployment-free evaluation. On this basis, we derive two complementary benchmarks that comprehensively assess TV-use capabilities: TVWorld-N for topology-aware navigation and TVWorld-G for focus-aware grounding. These benchmarks expose a key limitation of existing agents: insufficient topology awareness for focus-based, long-horizon TV navigation. Motivated by this finding, we propose a Topology-Aware Training framework that injects topology awareness into LVLMs. Using this framework, we develop TVTheseus, a foundation model specialized for TV navigation.
- 🕹️ TVWorld-N is an offline interactive TV navigation environment for evaluating agents' topology-aware planning under focus-based remote-control, supporting both textual and visual goals. Operating purely on static graph assets, it is fully replayable and deployment-free (e.g., no VMs/emulators), and enables millisecond-level interaction, avoiding the instability and overhead of online GUI benchmarks.
- 🎯 TVWorld-G evaluates focus-aware grounding by requiring the agent to localize the currently highlighted element within the global screen layout using bounding-box annotations, directly reflecting the focus-based nature of TV control.
- 🤖 TVTheseus is a foundation model trained with the Topology-Aware Training framework, designed for robust and generalizable TV control by leveraging structured UI topology and focus-driven interaction.
We open-source our complete pipeline to support further research in this area. All codes, models, and datasets are publicly available:
| 🤗 Model | 🤗 Graph Resources | 🤗 Benchmark |
|---|---|---|
| TVTheseus | TVWorld | TVWorld-N & TVWorld-G |
Dependable TV navigation requires TV-use agents to reason over focus-based UI transitions in a goal-directed manner, while remaining robust to navigation errors such as detours and stalled states. We collectively refer to this interaction-level competence as topology awareness. To embed this latent capability into TV-use agents, we introduce a two-stage training approach that first injects topology-aware inductive biases via topology-priming supervised fine-tuning, and then progressively consolidates them through topology-augmented reinforcement learning. Through this training paradigm, we obtain TVTheseus, a foundation model specialized for robust and generalizable TV control.
We recommend following the official VeRL installation guide. Below are the key package versions used in our setup:
python == 3.12
CUDA == 12.8
accelerate == 1.11.0
deepspeed == 0.18.2
flash_attn == 2.8.1
flashinfer-python == 0.3.1
ray == 2.51.1
torch == 2.8.0
transformers == 4.57.1
vllm == 0.11.0
xformers == 0.0.32.post1
xgrammar == 0.1.25
Please refer to this.
This repository is built on SWIRL. We gratefully acknowledge the open-source projects that made this work possible: VeRL, Qwen2.5-VL, vLLM.
If you feel TVWorld useful in your project or research, please kindly use the following BibTeX entry to cite our paper and give us a star. Thanks!
@article{ma2026tvworld,
title={TVWorld: Foundations for Remote-Control TV Agents},
author={Ma, Zhantao and Lu, Quanfeng and Zhong, Shuai and Yu, Dahai and Luo, Ping and Ng, Michael K},
journal={arXiv preprint arXiv:2601.13142},
year={2026}
}



