Slackbot
06/01/2023, 5:52 PMDidip Kerabat
06/01/2023, 6:07 PMSergio Ferragut
06/01/2023, 6:35 PMhttpRemote
for druid.indexer.runner.type
which has proven better than using remote
which uses ZK to assign tasks. ZK has been known to cause some issues when assigning tasks. I mention it in case you've upgraded and are maybe still using remote
.Abhishek Agarwal
06/02/2023, 3:31 AMLuiz Augusto
06/02/2023, 10:16 AMdruid.indexer.runner.taskAssignmentTimeout
, default is PT5M) then things are ok again.
2023-06-01T17:32:30,139 INFO [TaskQueue-Manager] org.apache.druid.indexing.overlord.hrtr.HttpRemoteTaskRunner - Assigned a task[index_parallel_${datasource}_okkgapeo_2023-06-01T17:28:32.721Z] that is known already. Ignored.
2023-06-01T17:32:29,914 DEBUG [qtp472702055-190] org.apache.druid.jetty.RequestLog - 10.6.125.179 GET //10.6.114.212:8081/druid/indexer/v1/pendingTasks?datasource=${datasource} HTTP/1.1 200
2023-06-01T17:32:06,349 INFO [TaskQueue-Manager] org.apache.druid.indexing.overlord.hrtr.HttpRemoteTaskRunner - Assigned a task[index_parallel_${datasource}_okkgapeo_2023-06-01T17:28:32.721Z] that is known already. Ignored.
2023-06-01T17:31:39,862 INFO [TaskQueue-Manager] org.apache.druid.indexing.overlord.hrtr.HttpRemoteTaskRunner - Assigned a task[index_parallel_${datasource}_okkgapeo_2023-06-01T17:28:32.721Z] that is known already. Ignored.
Also I can see this log message starts from May 25, so it’s right after Druid 26 deployment (we quickly upgraded it as we needed window functions)Luiz Augusto
06/02/2023, 10:39 AM2023-06-02T03:58:39,708 WARN [HttpClient-Netty-Boss-0] org.jboss.netty.channel.SimpleChannelUpstreamHandler - EXCEPTION, please implement org.jboss.netty.handler.codec.http.HttpContentDecompressor.exceptionCaught() for proper handling.
2023-06-02T03:58:39,708 INFO [ServiceClientFactory-2] org.apache.druid.rpc.ServiceClientImpl - Service [index_kafka_${datasource}_ade43dbfaf8b849_kcmanoee] request [GET <http://10.6.149.241:8102/druid/worker/v1/chat/index_kafka_${datasource}_ade43dbfaf8b849_kcmanoee/time/start>] encountered exception on attempt #2; retrying in 4,000 ms (org.jboss.netty.channel.ChannelException: Faulty channel in resource pool)
Luiz Augusto
06/02/2023, 11:04 AM2023-06-02T10:49:26,131 INFO [HttpRemoteTaskRunner-worker-sync-2] org.apache.druid.indexing.overlord.hrtr.HttpRemoteTaskRunner - Task[index_parallel_${datasource}_eonnahel_2023-06-02T10:44:32.773Z] location changed on worker[10.6.141.151:8091]. new location[TaskLocation{host='10.6.141.151', port=8100, tlsPort=-1}].
2023-06-02T10:49:26,125 INFO [HttpRemoteTaskRunner-worker-sync-0] org.apache.druid.indexing.overlord.hrtr.HttpRemoteTaskRunner - Task[index_parallel_${datasource}_eonnahel_2023-06-02T10:44:32.773Z] started RUNNING on worker[10.6.141.151:8091].
2023-06-02T10:49:26,120 INFO [hrtr-pending-tasks-runner-0] org.apache.druid.indexing.overlord.hrtr.HttpRemoteTaskRunner - Assigning task [index_parallel_${datasource}_eonnahel_2023-06-02T10:44:32.773Z] to worker [10.6.141.151:8091]
2023-06-02T10:49:26,113 INFO [TaskQueue-Manager] org.apache.druid.indexing.overlord.hrtr.HttpRemoteTaskRunner - Assigned a task[index_parallel_${datasource}_eonnahel_2023-06-02T10:44:32.773Z] that is known already. Ignored.
(...)
2023-06-02T10:45:26,112 INFO [TaskQueue-Manager] org.apache.druid.indexing.overlord.hrtr.HttpRemoteTaskRunner - Assigned a task[index_parallel_${datasource}_eonnahel_2023-06-02T10:44:32.773Z] that is known already. Ignored.
2023-06-02T10:44:53,035 INFO [Coordinator-Exec--0] org.apache.druid.server.coordinator.duty.UnloadUnusedSegments - Dropping uneeded segment [${datasource}_2023-06-01T00:00:00.000Z_2023-06-02T00:00:00.000Z_2023-06-02T10:42:12.971Z] from server [10.6.141.66:8083] in tier [_default_tier]
2023-06-02T10:44:53,020 INFO [Coordinator-Exec--0] org.apache.druid.server.coordinator.duty.UnloadUnusedSegments - Dropping uneeded segment [${datasource}_2023-06-01T00:00:00.000Z_2023-06-02T00:00:00.000Z_2023-06-02T10:42:12.971Z] from server [10.6.123.81:8083] in tier [_default_tier]
2023-06-02T10:44:48,170 INFO [Coordinator-Exec--0] org.apache.druid.server.coordinator.DruidCoordinator - Successfully marked [1] segments of datasource [${datasource}] as unused
2023-06-02T10:44:43,209 INFO [Coordinator-Exec--0] org.apache.druid.server.coordinator.DruidCoordinator - Successfully marked [1] segments of datasource [${datasource}] as unused
2023-06-02T10:44:38,143 INFO [Coordinator-Exec--0] org.apache.druid.server.coordinator.DruidCoordinator - Successfully marked [1] segments of datasource [${datasource}] as unused
2023-06-02T10:44:33,272 INFO [TaskQueue-Manager] org.apache.druid.indexing.overlord.hrtr.HttpRemoteTaskRunner - Assigned a task[index_parallel_${datasource}_eonnahel_2023-06-02T10:44:32.773Z] that is known already. Ignored.
2023-06-02T10:44:33,186 INFO [Coordinator-Exec--0] org.apache.druid.server.coordinator.DruidCoordinator - Successfully marked [1] segments of datasource [${datasource}] as unused
2023-06-02T10:44:32,778 INFO [TaskQueue-Manager] org.apache.druid.indexing.overlord.hrtr.HttpRemoteTaskRunner - Adding pending task[index_parallel_${datasource}_eonnahel_2023-06-02T10:44:32.773Z].
2023-06-02T10:44:32,778 INFO [TaskQueue-Manager] org.apache.druid.indexing.overlord.TaskQueue - Asking taskRunner to run: index_parallel_${datasource}_eonnahel_2023-06-02T10:44:32.773Z
2023-06-02T10:44:32,778 INFO [qtp429393578-142] org.apache.druid.indexing.overlord.TaskLockbox - Adding task[index_parallel_${datasource}_eonnahel_2023-06-02T10:44:32.773Z] to activeTasks
2023-06-02T10:44:32,774 INFO [qtp429393578-142] org.apache.druid.indexing.overlord.MetadataTaskStorage - Inserting task index_parallel_${datasource}_eonnahel_2023-06-02T10:44:32.773Z with status: TaskStatus{id=index_parallel_${datasource}_eonnahel_2023-06-02T10:44:32.773Z, status=RUNNING, duration=-1, errorMsg=null}
Abhishek Agarwal
06/02/2023, 11:18 AMdruid.indexer.runner.pendingTasksRunnerNumThreads=40
druid.indexer.runner.workerSyncNumThreads=20
Abhishek Agarwal
06/02/2023, 11:18 AMAbhishek Agarwal
06/02/2023, 11:19 AMLuiz Augusto
06/02/2023, 11:25 AMLuiz Augusto
06/02/2023, 11:27 AMAbhishek Agarwal
06/02/2023, 11:29 AMLuiz Augusto
06/02/2023, 11:50 AMAbhishek Agarwal
06/02/2023, 12:01 PMLuiz Augusto
06/02/2023, 12:06 PMhow many druid can use. a druid config property.I cant find this property. What is it?
How many active tasks and supervisors do you have btw?I should be like ~25, mostly Kafka ingestions, but also a few scheduled batch ones running often. But we’re never reaching this 25-ish as they got stuck in pending.
Abhishek Agarwal
06/02/2023, 12:13 PMLuiz Augusto
06/02/2023, 12:15 PMdruid.indexer.runner.type
, so I believe we were using remote up to 25 and httpRemote for the last week.Abhishek Agarwal
06/02/2023, 12:49 PMLuiz Augusto
06/02/2023, 1:01 PMAbhishek Agarwal
06/02/2023, 1:55 PMLuiz Augusto
06/05/2023, 10:25 AMremote
solved the issue as soon as the overlords restarted. No more tasks get stuck.
When I get a chance, I’ll try httpRemote
again using the configs you suggested.Gian Merlino
06/07/2023, 11:59 AMLuiz Augusto
06/21/2023, 5:35 PMPENDING
state for ~5 minutes, while my MM’s are in ~40-50% capacity.
Curiously moving back from httpRemote
to remote
runner worked for a while (~10 days?), but we’re back to the same issue.
Last time @Abhishek Agarwal suggested I should try
druid.indexer.runner.pendingTasksRunnerNumThreads=40
druid.indexer.runner.workerSyncNumThreads=20
Checking the code looks like remote
uses only pendingTasksRunnerNumThreads (default = 1), while httpRemote
also uses workerSyncNumThreads (default = 5).
He also suggested increasing the max simultaneous connections to the metadata store, but I’m unsure what this property is. Is it druid.sql.avatica.maxConnections?
Any other stuff I can try here to unstuck the Overlord task assignment?Luiz Augusto
06/21/2023, 6:48 PMAbhishek Agarwal
06/22/2023, 3:55 AMdruid.indexer.runner.pendingTasksRunnerNumThreads=20
druid.indexer.runner.workerSyncNumThreads=40
it's a bit different from what I suggested earlier. You should also increase the number of max db connections that druid allows. This can be done through
druid.metadata.storage.connector.dbcp.maxIdle=64
druid.metadata.storage.connector.dbcp.maxTotal=64
64 is just an exampleLuiz Augusto
06/22/2023, 8:49 AMhttpRemote
and started slow with the parameters (this one is not a big druid cluster)
pendingTasksRunnerNumThreads: 1 (default) -> 5
workerSyncNumThreads: 5 (default) -> 10
The results so far are good; tasks were assigned and started immediately. Probably when pendingTasksRunnerNumThreads=1 if the thread is stuck somehow, nothing else was scheduled, right?Luiz Augusto
06/22/2023, 8:59 AMremote
for ~10 days, I’ll keep a look here to check if this change actually fixes it.Sergio Ferragut
06/27/2023, 4:14 PMLuiz Augusto
06/27/2023, 5:08 PMpendingTasksRunnerNumThreads: 1 (default) -> 5
workerSyncNumThreads: 5 (default) -> 10
Luiz Augusto
06/27/2023, 5:14 PMSergio Ferragut
06/27/2023, 6:24 PMAbhishek Agarwal
06/28/2023, 3:16 AM