Agent with dispatcher by dongwang218 · Pull Request #131 · facebookresearch/matrix

dongwang218 · 2026-03-18T04:19:09Z

Why ?

For fault tolerance, actor registered with sink, sink also used optimisitic timeout to decide when orchestrator is died. The implementation has overhead as it is difficult to determine the timeout. It causes uneven outstanding num of requests.

How ?

For each agent role, we create a dispatcher. Our incoming orchestrator and outgoing orchestrator goes through the corresponding dispatcher. In this way, it is easy to keep track of the outstanding orchestrators. When actor is killed and restarted, the outstanding orchestrator can be marked dead quickly.

Test plan

repeat test cases in #115

deploy model

8 h100x8 node, 64 llama 8b model replica
matrix deploy_applications --action replace --applications "[{'model_name': '/datasets/pretrained-llms/Llama-3.1-8B-Instruct', 'use_grpc': 'true', 'min_replica': 64, 'model_size': '8B', 'name': '8B'}]"

test coral fault tolerance.

python -m matrix.agents.p2p_agents --config-name=coral_mmlu_pro.yaml max_concurrent_tasks=3000 \
    dataset.cut_off=10000 resources.student_llm.matrix_service=8B resources.teacher_llm.matrix_service=8B \
    resources.extractor_llm.matrix_service=8B output.path=/home/dongwang/checkpoint/agent_exp/rebuttal/coral/coral_200k.jsonl.zst debug=False num_trial=20 \
    dataset.name=json dataset.data_files=/home/dongwang/workspace/ts-interaction/data/mmlu_pro/resplit/train.jsonl \
    dataset.split=train agents.answer_extractor.num_instances=1 agents.teacher.num_instances=1 \
    agents.student.num_instances=1 agents.answer_matcher.num_instances=1 \
    resources.teacher_llm.exec_params.timeout_secs=7200 resources.student_llm.exec_params.timeout_secs=7200 \
    resources.extractor_llm.exec_params.timeout_secs=3600 

conv_err: 0.0000
[2026-03-17 05:22:32,078][__main__][INFO] - agreement: 0.9359
[2026-03-17 05:22:32,078][__main__][INFO] - agreement_correctness: 0.4781
[2026-03-17 05:22:32,079][__main__][INFO] - total_turns: 3.7608
[2026-03-17 05:22:32,079][__main__][INFO] - t_len: 424.5190
[2026-03-17 05:22:32,079][__main__][INFO] - s_len: 251.9076
[2026-03-17 05:22:32,080][__main__][INFO] - conv_len: 1956.0046
2026-03-17T03-56-07_0: 100%|██████████| 200000/200000 [1:26:16<00:00, 38.64task/s]
student+teacher completion tokens 168342691+222858225

adding kill actors

python ~/workspace/github/matrix/matrix/scripts/kill_ray_actor.py random --actor_names='["teacher_0", "student_0", "answer_extractor_0", "answer_matcher_0"]' --namespace 2026-01-29T19-32-13_0 --max_kills 10 --interval 720

 OVERALL SUMMARY
================================================================================
                    num_periods  total_duration  mean_duration  min_duration  max_duration  total_messages
role
answer_extractor_0            2           0.645          0.322         0.074         0.571             424
answer_matcher_0              2           0.000          0.000         0.000         0.000               2
student_0                     1           2.304          2.304         2.304         2.304            1474
teacher_0                     2           4.138          2.069         1.284         2.854            2180

Total dead periods across all roles: 7
Total death messages: 4080
Total dead time: 7.087s

so we have actors killed 7 times, each 12 mintues apart.

conv_err: 0.0000
[2026-03-17 19:01:22,300][__main__][INFO] - agreement: 0.9470
[2026-03-17 19:01:22,301][__main__][INFO] - agreement_correctness: 0.4856
[2026-03-17 19:01:22,301][__main__][INFO] - total_turns: 3.5505
[2026-03-17 19:01:22,302][__main__][INFO] - t_len: 416.8791
[2026-03-17 19:01:22,302][__main__][INFO] - s_len: 242.2364
[2026-03-17 19:01:22,303][__main__][INFO] - conv_len: 1737.1324
[2026-03-17 19:01:22,305][__main__][WARNING] - Dead/lost tasks: 4080
2026-03-17T17-42-32_0: 100%|██████████| 200000/200000 [1:18:43<00:00, 42.35task/s]
student+teacher completion tokens 143449781+196889205

dongwang218 added 7 commits March 13, 2026 00:04

Add orchestrator dispatcher to simplify fault tolerance

f119d26

retry send to dispatcher and sink in case of network issue

58c3de4

checkout can't be blocking

052d325

change to dispatch push

985c715

fix bug that agents got gc

c7e7f3d

sleep when actor is down

3e0a824

ignore a non-existent submit

264201c

dongwang218 requested review from swdanielli and yangli5t as code owners March 18, 2026 04:19

meta-cla bot added the CLA Signed This label is managed by the Meta Open Source bot. label Mar 18, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Agent with dispatcher#131

Agent with dispatcher#131
dongwang218 wants to merge 7 commits intomainfrom
agent_with_router

dongwang218 commented Mar 18, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

dongwang218 commented Mar 18, 2026

Why ?

How ?

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant