Skip to content

Conversation

@TangSiyang2001
Copy link
Collaborator

What problem does this PR solve?

Problem Summary:

When BE down, corresponding tasks will never finish until timeout. Fix this problem by adding a daemon thrad to do clean up.

Release note

None

Check List (For Author)

  • Test

    • Regression test
    • Unit Test
    • Manual test (add detailed scripts or steps below)
    • No need to test or manual test. Explain why:
      • This is a refactor/code format and no logic has been changed.
      • Previous test can cover this change.
      • No code files have been changed.
      • Other reason
  • Behavior changed:

    • No.
    • Yes.
  • Does this need documentation?

    • No.
    • Yes.

Check List (For Reviewer who merge this PR)

  • Confirm the release note
  • Confirm test cases
  • Confirm document
  • Add branch pick label

@TangSiyang2001 TangSiyang2001 marked this pull request as draft October 31, 2025 11:36
@hello-stephen
Copy link
Contributor

Thank you for your contribution to Apache Doris.
Don't know what should be done next? See How to process your PR.

Please clearly describe your PR:

  1. What problem was fixed (it's best to include specific error reporting information). How it was fixed.
  2. Which behaviors were modified. What was the previous behavior, what is it now, why was it modified, and what possible impacts might there be.
  3. What features were added. Why was this function added?
  4. Which code was refactored and why was this part of the code refactored?
  5. Which functions were optimized and what is the difference before and after the optimization?

@TangSiyang2001 TangSiyang2001 force-pushed the fix-agent-batch-task-stuck branch 2 times, most recently from f49b444 to 0fc79e3 Compare October 31, 2025 11:45
@TangSiyang2001 TangSiyang2001 force-pushed the fix-agent-batch-task-stuck branch 5 times, most recently from f95032d to 84a8250 Compare November 4, 2025 09:15
@TangSiyang2001
Copy link
Collaborator Author

run buildall

@TangSiyang2001 TangSiyang2001 marked this pull request as ready for review November 4, 2025 09:24
@TangSiyang2001 TangSiyang2001 force-pushed the fix-agent-batch-task-stuck branch from 84a8250 to 9a11c54 Compare November 4, 2025 09:38
@TangSiyang2001
Copy link
Collaborator Author

run buildall

@TangSiyang2001 TangSiyang2001 force-pushed the fix-agent-batch-task-stuck branch from 9a11c54 to 905e5dc Compare November 4, 2025 09:53
@TangSiyang2001
Copy link
Collaborator Author

run buildall

@doris-robot
Copy link

TPC-DS: Total hot run time: 189681 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpcds-tools
TPC-DS sf100 test result on commit 905e5dcd39499dad0466283c7bce50a44d6ec482, data reload: false

query1	1056	404	388	388
query2	6594	1690	1706	1690
query3	6750	225	228	225
query4	26553	23877	23511	23511
query5	5357	630	463	463
query6	333	236	209	209
query7	4654	488	301	301
query8	300	272	240	240
query9	8703	2561	2582	2561
query10	563	332	285	285
query11	15713	15030	14864	14864
query12	181	119	113	113
query13	1670	551	441	441
query14	11229	9304	9183	9183
query15	198	190	176	176
query16	7684	674	522	522
query17	1317	751	623	623
query18	2065	497	425	425
query19	229	219	196	196
query20	148	151	140	140
query21	228	148	121	121
query22	4709	4768	4677	4677
query23	34717	33560	33762	33560
query24	8664	2505	2521	2505
query25	632	561	493	493
query26	1270	309	176	176
query27	2857	514	392	392
query28	4442	2221	2198	2198
query29	830	633	514	514
query30	300	241	205	205
query31	921	850	778	778
query32	82	73	70	70
query33	602	398	330	330
query34	813	892	535	535
query35	860	863	881	863
query36	1181	1017	928	928
query37	127	113	89	89
query38	3637	3680	3503	3503
query39	1469	1408	1421	1408
query40	220	125	112	112
query41	59	59	55	55
query42	119	109	108	108
query43	483	483	467	467
query44	1217	732	725	725
query45	181	179	171	171
query46	899	996	630	630
query47	1747	1778	1723	1723
query48	402	420	315	315
query49	801	520	431	431
query50	636	696	406	406
query51	3855	3995	3817	3817
query52	110	111	96	96
query53	237	266	202	202
query54	325	306	284	284
query55	86	85	86	85
query56	363	315	316	315
query57	1185	1206	1100	1100
query58	297	292	284	284
query59	2555	2662	2572	2572
query60	371	355	347	347
query61	192	183	185	183
query62	793	742	649	649
query63	235	189	198	189
query64	4634	1297	993	993
query65	4023	4213	4015	4015
query66	1074	487	355	355
query67	15236	14909	15190	14909
query68	8394	880	596	596
query69	503	341	309	309
query70	1342	1321	1277	1277
query71	515	352	317	317
query72	6167	4868	4830	4830
query73	694	570	363	363
query74	9071	8946	8582	8582
query75	4012	3439	2797	2797
query76	3857	1166	741	741
query77	815	406	311	311
query78	9557	9895	9143	9143
query79	2010	830	596	596
query80	640	564	490	490
query81	472	266	232	232
query82	419	163	135	135
query83	267	266	245	245
query84	254	117	89	89
query85	875	488	448	448
query86	339	301	280	280
query87	3700	3700	3623	3623
query88	3410	2200	2216	2200
query89	382	324	297	297
query90	2047	213	215	213
query91	164	165	139	139
query92	87	66	60	60
query93	1139	965	637	637
query94	692	429	340	340
query95	404	320	295	295
query96	486	579	274	274
query97	2920	2996	2883	2883
query98	235	210	234	210
query99	1516	1400	1290	1290
Total cold run time: 279358 ms
Total hot run time: 189681 ms

@doris-robot
Copy link

ClickBench: Total hot run time: 27.43 s
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/clickbench-tools
ClickBench test result on commit 905e5dcd39499dad0466283c7bce50a44d6ec482, data reload: false

query1	0.05	0.04	0.04
query2	0.09	0.04	0.04
query3	0.25	0.08	0.08
query4	1.61	0.12	0.12
query5	0.27	0.26	0.25
query6	1.17	0.66	0.64
query7	0.03	0.03	0.03
query8	0.05	0.04	0.05
query9	0.59	0.53	0.50
query10	0.57	0.58	0.57
query11	0.16	0.11	0.11
query12	0.16	0.12	0.12
query13	0.61	0.60	0.61
query14	0.99	0.99	1.00
query15	0.84	0.83	0.85
query16	0.39	0.41	0.39
query17	1.04	1.02	1.04
query18	0.22	0.22	0.20
query19	1.91	1.85	1.87
query20	0.01	0.02	0.01
query21	15.44	0.20	0.13
query22	5.08	0.07	0.04
query23	15.70	0.26	0.10
query24	3.07	1.27	0.38
query25	0.07	0.07	0.05
query26	0.14	0.13	0.13
query27	0.06	0.05	0.05
query28	4.38	1.13	0.94
query29	12.59	3.94	3.25
query30	0.27	0.13	0.12
query31	2.82	0.58	0.38
query32	3.23	0.56	0.47
query33	2.99	3.05	3.03
query34	15.83	5.23	4.52
query35	4.49	4.57	4.58
query36	0.68	0.50	0.50
query37	0.09	0.06	0.06
query38	0.07	0.04	0.04
query39	0.03	0.03	0.03
query40	0.17	0.16	0.13
query41	0.09	0.04	0.02
query42	0.04	0.03	0.03
query43	0.04	0.03	0.03
Total cold run time: 98.38 s
Total hot run time: 27.43 s

@TangSiyang2001 TangSiyang2001 force-pushed the fix-agent-batch-task-stuck branch from 905e5dc to 191e48a Compare November 4, 2025 10:57
@TangSiyang2001
Copy link
Collaborator Author

run buildall

@doris-robot
Copy link

TPC-DS: Total hot run time: 188815 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpcds-tools
TPC-DS sf100 test result on commit 191e48a5b7e97e6062a3d5d32c80538718d8d7f5, data reload: false

query1	1064	411	380	380
query2	6544	1668	1727	1668
query3	6770	223	228	223
query4	26180	23932	23242	23242
query5	5863	631	476	476
query6	329	237	228	228
query7	4643	489	295	295
query8	301	252	239	239
query9	8720	2557	2573	2557
query10	539	339	293	293
query11	15363	15060	14796	14796
query12	190	113	113	113
query13	1673	554	423	423
query14	11983	9064	9114	9064
query15	243	187	168	168
query16	7768	662	485	485
query17	1605	768	622	622
query18	2072	464	365	365
query19	257	250	212	212
query20	157	148	131	131
query21	275	142	126	126
query22	4842	4615	4555	4555
query23	34787	33849	33709	33709
query24	8497	2526	2493	2493
query25	574	573	487	487
query26	1263	280	174	174
query27	2750	511	378	378
query28	4494	2228	2259	2228
query29	785	631	502	502
query30	331	246	207	207
query31	952	833	777	777
query32	105	74	75	74
query33	604	374	327	327
query34	836	863	541	541
query35	843	855	827	827
query36	995	1029	935	935
query37	134	113	100	100
query38	3693	3691	3481	3481
query39	1464	1411	1399	1399
query40	216	126	116	116
query41	59	59	58	58
query42	127	109	114	109
query43	479	488	464	464
query44	1203	736	732	732
query45	180	179	171	171
query46	889	990	628	628
query47	1794	1823	1745	1745
query48	398	414	329	329
query49	760	490	418	418
query50	637	671	408	408
query51	3900	3885	3767	3767
query52	106	104	97	97
query53	236	263	191	191
query54	311	286	267	267
query55	82	82	81	81
query56	318	309	297	297
query57	1198	1195	1124	1124
query58	287	275	267	267
query59	2612	2602	2478	2478
query60	334	339	316	316
query61	162	157	154	154
query62	799	737	684	684
query63	222	192	191	191
query64	4478	1165	852	852
query65	4009	3950	3921	3921
query66	1046	425	326	326
query67	15632	15061	15040	15040
query68	8368	859	583	583
query69	526	333	289	289
query70	1292	1274	1216	1216
query71	498	341	307	307
query72	6060	4847	4875	4847
query73	641	580	355	355
query74	8847	9219	8912	8912
query75	3798	3532	2805	2805
query76	3567	1144	724	724
query77	816	401	318	318
query78	9499	9602	8874	8874
query79	2273	848	591	591
query80	637	566	525	525
query81	508	258	228	228
query82	468	156	128	128
query83	264	260	246	246
query84	249	120	99	99
query85	872	475	426	426
query86	333	321	314	314
query87	3652	3725	3622	3622
query88	3698	2241	2220	2220
query89	388	326	300	300
query90	2003	211	213	211
query91	164	164	134	134
query92	80	69	63	63
query93	1970	963	631	631
query94	714	441	338	338
query95	391	322	314	314
query96	473	569	283	283
query97	2959	2977	2844	2844
query98	245	210	213	210
query99	1588	1419	1327	1327
Total cold run time: 280803 ms
Total hot run time: 188815 ms

@doris-robot
Copy link

ClickBench: Total hot run time: 27.41 s
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/clickbench-tools
ClickBench test result on commit 191e48a5b7e97e6062a3d5d32c80538718d8d7f5, data reload: false

query1	0.06	0.05	0.06
query2	0.09	0.05	0.05
query3	0.26	0.08	0.09
query4	1.60	0.12	0.11
query5	0.27	0.27	0.25
query6	1.19	0.64	0.63
query7	0.03	0.03	0.03
query8	0.04	0.04	0.04
query9	0.60	0.52	0.52
query10	0.58	0.57	0.58
query11	0.17	0.11	0.12
query12	0.16	0.12	0.12
query13	0.62	0.60	0.61
query14	1.00	1.00	1.01
query15	0.87	0.82	0.82
query16	0.41	0.40	0.39
query17	1.03	1.05	1.00
query18	0.21	0.19	0.20
query19	1.87	1.81	1.77
query20	0.01	0.02	0.02
query21	15.45	0.18	0.14
query22	5.19	0.06	0.05
query23	15.65	0.26	0.10
query24	3.25	0.80	0.46
query25	0.07	0.06	0.06
query26	0.15	0.13	0.12
query27	0.07	0.06	0.05
query28	4.19	1.13	0.92
query29	12.62	3.99	3.26
query30	0.27	0.14	0.11
query31	2.81	0.59	0.38
query32	3.23	0.55	0.48
query33	3.08	3.04	3.09
query34	15.85	5.14	4.51
query35	4.61	4.62	4.55
query36	0.69	0.52	0.49
query37	0.09	0.06	0.06
query38	0.07	0.04	0.04
query39	0.04	0.03	0.02
query40	0.18	0.15	0.13
query41	0.08	0.04	0.03
query42	0.04	0.03	0.03
query43	0.04	0.03	0.03
Total cold run time: 98.79 s
Total hot run time: 27.41 s

@hello-stephen
Copy link
Contributor

FE Regression Coverage Report

Increment line coverage 47.73% (21/44) 🎉
Increment coverage report
Complete coverage report

gavinchou
gavinchou previously approved these changes Nov 5, 2025
@github-actions github-actions bot added the approved Indicates a PR has been approved by one committer. label Nov 5, 2025
@github-actions github-actions bot removed the approved Indicates a PR has been approved by one committer. label Nov 10, 2025
@TangSiyang2001
Copy link
Collaborator Author

run buildall

@TangSiyang2001 TangSiyang2001 force-pushed the fix-agent-batch-task-stuck branch from 3e248c3 to 65794f3 Compare November 10, 2025 05:12
@TangSiyang2001
Copy link
Collaborator Author

run buildall

@doris-robot
Copy link

TPC-H: Total hot run time: 34515 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpch-tools
Tpch sf100 test result on commit 65794f3d772772de0b6d5230eb80cd0bfd451b0a, data reload: false

------ Round 1 ----------------------------------
q1	17656	5220	5122	5122
q2	2065	325	207	207
q3	10209	1290	727	727
q4	10221	917	373	373
q5	7489	2331	2347	2331
q6	184	170	137	137
q7	897	767	613	613
q8	9349	1400	1139	1139
q9	6860	5161	5186	5161
q10	6824	2242	1841	1841
q11	487	305	277	277
q12	339	363	224	224
q13	17748	3673	3053	3053
q14	235	251	213	213
q15	564	513	510	510
q16	1060	1038	968	968
q17	613	860	383	383
q18	7593	7110	7096	7096
q19	1099	971	579	579
q20	348	352	234	234
q21	3941	3248	2331	2331
q22	1051	1037	996	996
Total cold run time: 106832 ms
Total hot run time: 34515 ms

----- Round 2, with runtime_filter_mode=off -----
q1	5259	5107	5197	5107
q2	252	328	240	240
q3	2121	2748	2268	2268
q4	1373	1804	1378	1378
q5	4198	4610	4624	4610
q6	226	179	135	135
q7	2033	2031	1798	1798
q8	2609	2565	2598	2565
q9	7310	7262	7263	7262
q10	3082	3324	2835	2835
q11	571	547	505	505
q12	681	811	663	663
q13	3591	3949	3499	3499
q14	316	307	300	300
q15	554	522	512	512
q16	1118	1088	1079	1079
q17	1192	1527	1417	1417
q18	8257	7765	7507	7507
q19	847	824	928	824
q20	2015	2149	1910	1910
q21	4997	4507	4390	4390
q22	1075	1048	1003	1003
Total cold run time: 53677 ms
Total hot run time: 51807 ms

@doris-robot
Copy link

TPC-DS: Total hot run time: 188569 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpcds-tools
TPC-DS sf100 test result on commit 65794f3d772772de0b6d5230eb80cd0bfd451b0a, data reload: false

query1	1038	402	400	400
query2	6564	1702	1710	1702
query3	6758	234	224	224
query4	26990	23768	23387	23387
query5	4443	629	478	478
query6	331	250	236	236
query7	4641	512	298	298
query8	306	274	256	256
query9	8720	2598	2614	2598
query10	512	368	299	299
query11	15338	15165	14799	14799
query12	176	124	117	117
query13	1701	569	451	451
query14	10903	9380	9319	9319
query15	204	197	175	175
query16	7760	679	546	546
query17	1274	792	638	638
query18	2035	429	333	333
query19	217	221	188	188
query20	137	125	123	123
query21	221	137	118	118
query22	3990	4350	3897	3897
query23	34105	33040	33180	33040
query24	8404	2453	2492	2453
query25	632	542	527	527
query26	1237	280	163	163
query27	2711	503	352	352
query28	4301	2235	2203	2203
query29	784	616	515	515
query30	305	223	203	203
query31	922	799	728	728
query32	87	73	78	73
query33	591	390	337	337
query34	818	869	550	550
query35	844	857	736	736
query36	957	999	908	908
query37	120	107	90	90
query38	3590	3484	3460	3460
query39	1449	1422	1421	1421
query40	222	129	122	122
query41	64	59	62	59
query42	132	117	110	110
query43	504	493	456	456
query44	1280	751	745	745
query45	189	179	172	172
query46	928	1012	666	666
query47	1761	1793	1702	1702
query48	390	425	320	320
query49	793	528	464	464
query50	680	690	410	410
query51	3909	4049	3912	3912
query52	112	109	103	103
query53	252	285	199	199
query54	312	309	276	276
query55	95	88	84	84
query56	336	334	310	310
query57	1198	1160	1122	1122
query58	289	278	278	278
query59	2534	2670	2654	2654
query60	356	367	330	330
query61	162	159	158	158
query62	855	754	658	658
query63	237	196	199	196
query64	4422	1165	837	837
query65	4041	3918	3966	3918
query66	1100	443	341	341
query67	15258	15187	14930	14930
query68	8531	954	603	603
query69	499	340	294	294
query70	1406	1259	1287	1259
query71	517	344	329	329
query72	6020	5003	4961	4961
query73	680	604	364	364
query74	9117	9146	8923	8923
query75	4070	3346	2808	2808
query76	3829	1163	762	762
query77	808	410	320	320
query78	9621	9638	8936	8936
query79	2929	824	598	598
query80	704	563	487	487
query81	483	271	236	236
query82	448	160	136	136
query83	300	259	247	247
query84	303	121	92	92
query85	912	482	437	437
query86	336	296	303	296
query87	3675	3799	3605	3605
query88	3170	2227	2265	2227
query89	403	339	299	299
query90	2052	217	215	215
query91	168	181	134	134
query92	82	68	63	63
query93	1540	983	646	646
query94	699	457	343	343
query95	399	315	316	315
query96	485	571	275	275
query97	2922	2969	2937	2937
query98	243	211	208	208
query99	1480	1437	1341	1341
Total cold run time: 277358 ms
Total hot run time: 188569 ms

@doris-robot
Copy link

ClickBench: Total hot run time: 27.38 s
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/clickbench-tools
ClickBench test result on commit 65794f3d772772de0b6d5230eb80cd0bfd451b0a, data reload: false

query1	0.05	0.05	0.05
query2	0.09	0.05	0.05
query3	0.25	0.08	0.08
query4	1.60	0.13	0.12
query5	0.27	0.25	0.24
query6	1.17	0.66	0.64
query7	0.03	0.02	0.02
query8	0.06	0.05	0.04
query9	0.60	0.52	0.51
query10	0.58	0.58	0.58
query11	0.17	0.11	0.11
query12	0.15	0.11	0.12
query13	0.62	0.60	0.60
query14	0.98	1.01	0.99
query15	0.84	0.84	0.83
query16	0.39	0.40	0.40
query17	1.01	1.05	1.04
query18	0.22	0.20	0.20
query19	1.91	1.85	1.86
query20	0.02	0.01	0.01
query21	15.45	0.18	0.13
query22	5.15	0.07	0.05
query23	15.65	0.26	0.10
query24	2.98	0.63	0.26
query25	0.07	0.06	0.05
query26	0.14	0.12	0.14
query27	0.07	0.05	0.06
query28	4.31	1.12	0.94
query29	12.62	3.80	3.22
query30	0.29	0.14	0.12
query31	2.81	0.58	0.38
query32	3.23	0.55	0.47
query33	2.97	3.01	3.06
query34	15.92	5.20	4.59
query35	4.52	4.62	4.56
query36	0.69	0.51	0.49
query37	0.09	0.07	0.07
query38	0.06	0.04	0.04
query39	0.03	0.04	0.03
query40	0.17	0.15	0.14
query41	0.09	0.03	0.03
query42	0.04	0.03	0.03
query43	0.05	0.04	0.03
Total cold run time: 98.41 s
Total hot run time: 27.38 s

@hello-stephen
Copy link
Contributor

FE Regression Coverage Report

Increment line coverage 43.48% (20/46) 🎉
Increment coverage report
Complete coverage report

@TangSiyang2001
Copy link
Collaborator Author

run p0

@TangSiyang2001
Copy link
Collaborator Author

run cloud_p0

@TangSiyang2001
Copy link
Collaborator Author

run nonConcurrent

@hello-stephen
Copy link
Contributor

FE Regression Coverage Report

Increment line coverage 45.65% (21/46) 🎉
Increment coverage report
Complete coverage report

2 similar comments
@hello-stephen
Copy link
Contributor

FE Regression Coverage Report

Increment line coverage 45.65% (21/46) 🎉
Increment coverage report
Complete coverage report

@hello-stephen
Copy link
Contributor

FE Regression Coverage Report

Increment line coverage 45.65% (21/46) 🎉
Increment coverage report
Complete coverage report

public class AgentTaskCleanupDaemon extends MasterDaemon {
private static final Logger LOG = LogManager.getLogger(AgentTaskCleanupDaemon.class);

public static final Integer MAX_FAILURE_TIMES = 3;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

3 is too small, heartbeat interval is 5 seconds sometimes.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, config in seconds is more readable and understood for users.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

3 is too small, heartbeat interval is 5 seconds sometimes.

But Config.agent_task_health_check_intervals_ms is actually 5minutes by default, >=MAX_FAILURE_TIMES means BE is unavailable more than 10 minutes.

@github-actions github-actions bot added the approved Indicates a PR has been approved by one committer. label Nov 12, 2025
@github-actions
Copy link
Contributor

PR approved by at least one committer and no changes requested.

Copy link
Contributor

@dataroaring dataroaring left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@dataroaring dataroaring merged commit fe602e6 into apache:master Nov 12, 2025
31 of 35 checks passed
wyxxxcat pushed a commit to wyxxxcat/doris that referenced this pull request Nov 13, 2025
… BEs (apache#57591)

When BE down, corresponding tasks will never finish until timeout. Fix
this problem by adding a daemon thrad to do clean up.
wyxxxcat pushed a commit to wyxxxcat/doris that referenced this pull request Nov 18, 2025
… BEs (apache#57591)

When BE down, corresponding tasks will never finish until timeout. Fix
this problem by adding a daemon thrad to do clean up.
TangSiyang2001 added a commit to TangSiyang2001/doris that referenced this pull request Nov 21, 2025
… BEs (apache#57591)

When BE down, corresponding tasks will never finish until timeout. Fix
this problem by adding a daemon thrad to do clean up.
TangSiyang2001 added a commit to TangSiyang2001/doris that referenced this pull request Dec 8, 2025
… BEs (apache#57591)

When BE down, corresponding tasks will never finish until timeout. Fix
this problem by adding a daemon thrad to do clean up.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants