Skip to content

Agent with dispatcher#131

Open
dongwang218 wants to merge 7 commits intomainfrom
agent_with_router
Open

Agent with dispatcher#131
dongwang218 wants to merge 7 commits intomainfrom
agent_with_router

Conversation

@dongwang218
Copy link
Contributor

Why ?

For fault tolerance, actor registered with sink, sink also used optimisitic timeout to decide when orchestrator is died. The implementation has overhead as it is difficult to determine the timeout. It causes uneven outstanding num of requests.

How ?

For each agent role, we create a dispatcher. Our incoming orchestrator and outgoing orchestrator goes through the corresponding dispatcher. In this way, it is easy to keep track of the outstanding orchestrators. When actor is killed and restarted, the outstanding orchestrator can be marked dead quickly.

Test plan

repeat test cases in #115

  • deploy model
8 h100x8 node, 64 llama 8b model replica
matrix deploy_applications --action replace --applications "[{'model_name': '/datasets/pretrained-llms/Llama-3.1-8B-Instruct', 'use_grpc': 'true', 'min_replica': 64, 'model_size': '8B', 'name': '8B'}]"
  • test coral fault tolerance.
python -m matrix.agents.p2p_agents --config-name=coral_mmlu_pro.yaml max_concurrent_tasks=3000 \
    dataset.cut_off=10000 resources.student_llm.matrix_service=8B resources.teacher_llm.matrix_service=8B \
    resources.extractor_llm.matrix_service=8B output.path=/home/dongwang/checkpoint/agent_exp/rebuttal/coral/coral_200k.jsonl.zst debug=False num_trial=20 \
    dataset.name=json dataset.data_files=/home/dongwang/workspace/ts-interaction/data/mmlu_pro/resplit/train.jsonl \
    dataset.split=train agents.answer_extractor.num_instances=1 agents.teacher.num_instances=1 \
    agents.student.num_instances=1 agents.answer_matcher.num_instances=1 \
    resources.teacher_llm.exec_params.timeout_secs=7200 resources.student_llm.exec_params.timeout_secs=7200 \
    resources.extractor_llm.exec_params.timeout_secs=3600 

conv_err: 0.0000
[2026-03-17 05:22:32,078][__main__][INFO] - agreement: 0.9359
[2026-03-17 05:22:32,078][__main__][INFO] - agreement_correctness: 0.4781
[2026-03-17 05:22:32,079][__main__][INFO] - total_turns: 3.7608
[2026-03-17 05:22:32,079][__main__][INFO] - t_len: 424.5190
[2026-03-17 05:22:32,079][__main__][INFO] - s_len: 251.9076
[2026-03-17 05:22:32,080][__main__][INFO] - conv_len: 1956.0046
2026-03-17T03-56-07_0: 100%|██████████| 200000/200000 [1:26:16<00:00, 38.64task/s]
student+teacher completion tokens 168342691+222858225
  • adding kill actors
python ~/workspace/github/matrix/matrix/scripts/kill_ray_actor.py random --actor_names='["teacher_0", "student_0", "answer_extractor_0", "answer_matcher_0"]' --namespace 2026-01-29T19-32-13_0 --max_kills 10 --interval 720

 OVERALL SUMMARY
================================================================================
                    num_periods  total_duration  mean_duration  min_duration  max_duration  total_messages
role
answer_extractor_0            2           0.645          0.322         0.074         0.571             424
answer_matcher_0              2           0.000          0.000         0.000         0.000               2
student_0                     1           2.304          2.304         2.304         2.304            1474
teacher_0                     2           4.138          2.069         1.284         2.854            2180

Total dead periods across all roles: 7
Total death messages: 4080
Total dead time: 7.087s

so we have actors killed 7 times, each 12 mintues apart.

conv_err: 0.0000
[2026-03-17 19:01:22,300][__main__][INFO] - agreement: 0.9470
[2026-03-17 19:01:22,301][__main__][INFO] - agreement_correctness: 0.4856
[2026-03-17 19:01:22,301][__main__][INFO] - total_turns: 3.5505
[2026-03-17 19:01:22,302][__main__][INFO] - t_len: 416.8791
[2026-03-17 19:01:22,302][__main__][INFO] - s_len: 242.2364
[2026-03-17 19:01:22,303][__main__][INFO] - conv_len: 1737.1324
[2026-03-17 19:01:22,305][__main__][WARNING] - Dead/lost tasks: 4080
2026-03-17T17-42-32_0: 100%|██████████| 200000/200000 [1:18:43<00:00, 42.35task/s]
student+teacher completion tokens 143449781+196889205

@meta-cla meta-cla bot added the CLA Signed This label is managed by the Meta Open Source bot. label Mar 18, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Meta Open Source bot.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant