Skip to content

Lqf-HFNJU/TVTheseus

Repository files navigation

📖 Paper | 🤗 Model | 📊 Dataset


⭐️ Introduction

We introduce 📺 TVWorld, an offline graph-based abstraction of real-world TV navigation that enables reproducible and deployment-free evaluation. On this basis, we derive two complementary benchmarks that comprehensively assess TV-use capabilities: TVWorld-N for topology-aware navigation and TVWorld-G for focus-aware grounding. These benchmarks expose a key limitation of existing agents: insufficient topology awareness for focus-based, long-horizon TV navigation. Motivated by this finding, we propose a Topology-Aware Training framework that injects topology awareness into LVLMs. Using this framework, we develop TVTheseus, a foundation model specialized for TV navigation.

📕 Core Contributions

  • 🕹️ TVWorld-N is an offline interactive TV navigation environment for evaluating agents' topology-aware planning under focus-based remote-control, supporting both textual and visual goals. Operating purely on static graph assets, it is fully replayable and deployment-free (e.g., no VMs/emulators), and enables millisecond-level interaction, avoiding the instability and overhead of online GUI benchmarks.
  • 🎯 TVWorld-G evaluates focus-aware grounding by requiring the agent to localize the currently highlighted element within the global screen layout using bounding-box annotations, directly reflecting the focus-based nature of TV control.
  • 🤖 TVTheseus is a foundation model trained with the Topology-Aware Training framework, designed for robust and generalizable TV control by leveraging structured UI topology and focus-driven interaction.

💫 Open-Source Release

We open-source our complete pipeline to support further research in this area. All codes, models, and datasets are publicly available:

🤗 Model 🤗 Graph Resources 🤗 Benchmark
TVTheseus TVWorld TVWorld-N & TVWorld-G

📝 Training pipeline

Dependable TV navigation requires TV-use agents to reason over focus-based UI transitions in a goal-directed manner, while remaining robust to navigation errors such as detours and stalled states. We collectively refer to this interaction-level competence as topology awareness. To embed this latent capability into TV-use agents, we introduce a two-stage training approach that first injects topology-aware inductive biases via topology-priming supervised fine-tuning, and then progressively consolidates them through topology-augmented reinforcement learning. Through this training paradigm, we obtain TVTheseus, a foundation model specialized for robust and generalizable TV control.


🧪 Experiments

🕹️ TVWorld-N

🎯 TVWorld-G


📦 Installation

We recommend following the official VeRL installation guide. Below are the key package versions used in our setup:

python == 3.12
CUDA == 12.8

accelerate == 1.11.0
deepspeed == 0.18.2
flash_attn == 2.8.1
flashinfer-python == 0.3.1
ray == 2.51.1
torch == 2.8.0
transformers == 4.57.1
vllm == 0.11.0
xformers == 0.0.32.post1
xgrammar == 0.1.25

🚀 Quick Start

Please refer to this.

🎓 Acknowledgements

This repository is built on SWIRL. We gratefully acknowledge the open-source projects that made this work possible: VeRL, Qwen2.5-VL, vLLM.

🖊️ Citation

If you feel TVWorld useful in your project or research, please kindly use the following BibTeX entry to cite our paper and give us a star. Thanks!

@article{ma2026tvworld,
  title={TVWorld: Foundations for Remote-Control TV Agents},
  author={Ma, Zhantao and Lu, Quanfeng and Zhong, Shuai and Yu, Dahai and Luo, Ping and Ng, Michael K},
  journal={arXiv preprint arXiv:2601.13142},
  year={2026}
}

Releases

No releases published

Packages

 
 
 

Contributors