Document timeout hang points in Elasticsearch download workflow#463
Document timeout hang points in Elasticsearch download workflow#463
Conversation
Co-authored-by: timrobertson100 <237221+timrobertson100@users.noreply.github.com>
|
More information. It logs this kind of thing, and then hangs. Note that these are examples from 2 different runs, so ignore the timestamps. @copilot Please explore if there is some problem with the akka handling exceptions in the threads. |
…tor processing Added synchronization to SearchHitConverter.mapTerm() to prevent concurrent HashMap modifications when multiple Akka actors process downloads simultaneously. Also added threading documentation to DownloadMaster.createDownloadActor() explaining the issue. Co-authored-by: timrobertson100 <237221+timrobertson100@users.noreply.github.com>
... Fixed in commit d5b3d53. The The fix adds a Warning Firewall rules blocked me from connecting to one or more addresses (expand for details)I tried to connect to the following addresses, but was blocked by firewall rules:
If you need me to access, download, or install something from one of these locations, you can either:
|
|
That may be the cause @copilot but why did it then just hang when it hit that state? |
The hang occurs because of a race condition in the error handling:
The synchronization fix prevents the exception from occurring in the first place. Without it, there's a window where the exception notification can be lost in Akka's message queue, causing the observed hang. |
Investigation Complete: Small Download Hang Issue
Problem Analysis ✅
Summary of Findings
Root Cause 1: Missing Query Timeouts
Small occurrence downloads hang because ES queries lack explicit query-level timeouts. They rely only on HTTP socketTimeout (100 seconds), which may be insufficient when Elasticsearch is slow or overloaded.
Root Cause 2: Thread Safety Issue (NEW)
ConcurrentModificationExceptionoccurs inTermFactory.addTerm()when multiple Akka actors process downloads concurrently. TheRoundRobinPoolcreates multiple actors that all callSearchHitConverter.mapTerm(), which uses a sharedTermFactoryinstance without synchronization.Three Critical Hang Points Documented
1. ElasticDownloadWorkflow.recordCount() - Line 117
2. SearchQueryProcessor.processQuery() - Line 81
3. DownloadMaster.getSearchCount() - Line 203
Fixes Applied
Thread Safety Fix ✅
File:
SearchHitConverter.javasynchronizedblock tomapTerm()methodConcurrentModificationExceptionwhen multiple actors accessTermFactoryFile:
DownloadMaster.javacreateDownloadActor()methodRoundRobinPoolcreates concurrent actorsTimeout Configuration
Current Settings (EsConfig):
connectTimeout: 6 seconds (connection establishment)socketTimeout: 100 seconds (HTTP response wait time)Documentation Added
Comprehensive timeout and hang point comments added to 6 files:
Recommendations for Further Fixing Timeout Issue
Immediate Fix (Low Risk):
Configuration:
Additional Improvements:
Files Modified
Next Steps
Original prompt
✨ Let Copilot coding agent set things up for you — coding agent works faster and does higher quality work when set up for your repo.