Scaling EsExecutors with core size 0 might starve work due to missing workers #124667
Labels
blocker
>bug
:Core/Infra/Core
Core issues without another label
Team:Core/Infra
Meta label for core/infra team
This was discovered in cases where
masterService#updateTask
had work enqueued, but no worker to process it.The root cause of the issue is a bug in
EsExecutors
. When the pool core size is set to 0 and max pool size is 1 (though, also possible with a higher max pool size, but less likely),EsThreadPoolExecutor
sometimes fails to add another worker to execute the task because we're already at the max pool size (expected). However, in rare cases, a single worker thread (or threads) can time out at about the same time (based on theirkeepAliveTime
) when then queueing the new task viaForceQueuePolicy
(triggered by the initial rejection as we failed to add a worker). Unless more tasks are submitted later (which is not the case formasterService#updateTask
), this task will starve in the queue without any worker to process it.Respective code in
EsExecutors
is old and unchanged. We were able to reproduce the bug onmain
using Java 21, 22, 23 as well as8.0
using Java 17. Likely the same is possible for older versions of ES.It looks as if the bug is triggered more frequently with more recent versions of the JDK, but this might just be an observation bias as we haven't been aware of this bug earlier.
The text was updated successfully, but these errors were encountered: