This message was deleted.
# troubleshooting
s
This message was deleted.
d
not enough replication thread on coordinator dynamic config?
s
Does the overlord log show any errors or warnings? The overlord is what assigns tasks to workers. 26.0 defaults to
httpRemote
for
druid.indexer.runner.type
which has proven better than using
remote
which uses ZK to assign tasks. ZK has been known to cause some issues when assigning tasks. I mention it in case you've upgraded and are maybe still using
remote
.
a
How many tasks do you usually have in the system?
l
I see logs like this over and over. They are around for up to 5 minutes (which I’m assuming is
druid.indexer.runner.taskAssignmentTimeout
, default is PT5M) then things are ok again.
Copy code
2023-06-01T17:32:30,139 INFO [TaskQueue-Manager] org.apache.druid.indexing.overlord.hrtr.HttpRemoteTaskRunner - Assigned a task[index_parallel_${datasource}_okkgapeo_2023-06-01T17:28:32.721Z] that is known already. Ignored.

2023-06-01T17:32:29,914 DEBUG [qtp472702055-190] org.apache.druid.jetty.RequestLog - 10.6.125.179 GET //10.6.114.212:8081/druid/indexer/v1/pendingTasks?datasource=${datasource} HTTP/1.1 200

2023-06-01T17:32:06,349 INFO [TaskQueue-Manager] org.apache.druid.indexing.overlord.hrtr.HttpRemoteTaskRunner - Assigned a task[index_parallel_${datasource}_okkgapeo_2023-06-01T17:28:32.721Z] that is known already. Ignored.

2023-06-01T17:31:39,862 INFO [TaskQueue-Manager] org.apache.druid.indexing.overlord.hrtr.HttpRemoteTaskRunner - Assigned a task[index_parallel_${datasource}_okkgapeo_2023-06-01T17:28:32.721Z] that is known already. Ignored.
Also I can see this log message starts from May 25, so it’s right after Druid 26 deployment (we quickly upgraded it as we needed window functions)
Also, good chunk of exceptions like
Copy code
2023-06-02T03:58:39,708 WARN [HttpClient-Netty-Boss-0] org.jboss.netty.channel.SimpleChannelUpstreamHandler - EXCEPTION, please implement org.jboss.netty.handler.codec.http.HttpContentDecompressor.exceptionCaught() for proper handling.

2023-06-02T03:58:39,708 INFO [ServiceClientFactory-2] org.apache.druid.rpc.ServiceClientImpl - Service [index_kafka_${datasource}_ade43dbfaf8b849_kcmanoee] request [GET <http://10.6.149.241:8102/druid/worker/v1/chat/index_kafka_${datasource}_ade43dbfaf8b849_kcmanoee/time/start>] encountered exception on attempt #2; retrying in 4,000 ms (org.jboss.netty.channel.ChannelException: Faulty channel in resource pool)
Heres the flow of a task pending for 5 minutes
Copy code
2023-06-02T10:49:26,131 INFO [HttpRemoteTaskRunner-worker-sync-2] org.apache.druid.indexing.overlord.hrtr.HttpRemoteTaskRunner - Task[index_parallel_${datasource}_eonnahel_2023-06-02T10:44:32.773Z] location changed on worker[10.6.141.151:8091]. new location[TaskLocation{host='10.6.141.151', port=8100, tlsPort=-1}].

2023-06-02T10:49:26,125 INFO [HttpRemoteTaskRunner-worker-sync-0] org.apache.druid.indexing.overlord.hrtr.HttpRemoteTaskRunner - Task[index_parallel_${datasource}_eonnahel_2023-06-02T10:44:32.773Z] started RUNNING on worker[10.6.141.151:8091].

2023-06-02T10:49:26,120 INFO [hrtr-pending-tasks-runner-0] org.apache.druid.indexing.overlord.hrtr.HttpRemoteTaskRunner - Assigning task [index_parallel_${datasource}_eonnahel_2023-06-02T10:44:32.773Z] to worker [10.6.141.151:8091]

2023-06-02T10:49:26,113 INFO [TaskQueue-Manager] org.apache.druid.indexing.overlord.hrtr.HttpRemoteTaskRunner - Assigned a task[index_parallel_${datasource}_eonnahel_2023-06-02T10:44:32.773Z] that is known already. Ignored.

(...)

2023-06-02T10:45:26,112 INFO [TaskQueue-Manager] org.apache.druid.indexing.overlord.hrtr.HttpRemoteTaskRunner - Assigned a task[index_parallel_${datasource}_eonnahel_2023-06-02T10:44:32.773Z] that is known already. Ignored.

2023-06-02T10:44:53,035 INFO [Coordinator-Exec--0] org.apache.druid.server.coordinator.duty.UnloadUnusedSegments - Dropping uneeded segment [${datasource}_2023-06-01T00:00:00.000Z_2023-06-02T00:00:00.000Z_2023-06-02T10:42:12.971Z] from server [10.6.141.66:8083] in tier [_default_tier]

2023-06-02T10:44:53,020 INFO [Coordinator-Exec--0] org.apache.druid.server.coordinator.duty.UnloadUnusedSegments - Dropping uneeded segment [${datasource}_2023-06-01T00:00:00.000Z_2023-06-02T00:00:00.000Z_2023-06-02T10:42:12.971Z] from server [10.6.123.81:8083] in tier [_default_tier]

2023-06-02T10:44:48,170 INFO [Coordinator-Exec--0] org.apache.druid.server.coordinator.DruidCoordinator - Successfully marked [1] segments of datasource [${datasource}] as unused

2023-06-02T10:44:43,209 INFO [Coordinator-Exec--0] org.apache.druid.server.coordinator.DruidCoordinator - Successfully marked [1] segments of datasource [${datasource}] as unused

2023-06-02T10:44:38,143 INFO [Coordinator-Exec--0] org.apache.druid.server.coordinator.DruidCoordinator - Successfully marked [1] segments of datasource [${datasource}] as unused

2023-06-02T10:44:33,272 INFO [TaskQueue-Manager] org.apache.druid.indexing.overlord.hrtr.HttpRemoteTaskRunner - Assigned a task[index_parallel_${datasource}_eonnahel_2023-06-02T10:44:32.773Z] that is known already. Ignored.

2023-06-02T10:44:33,186 INFO [Coordinator-Exec--0] org.apache.druid.server.coordinator.DruidCoordinator - Successfully marked [1] segments of datasource [${datasource}] as unused

2023-06-02T10:44:32,778 INFO [TaskQueue-Manager] org.apache.druid.indexing.overlord.hrtr.HttpRemoteTaskRunner - Adding pending task[index_parallel_${datasource}_eonnahel_2023-06-02T10:44:32.773Z].

2023-06-02T10:44:32,778 INFO [TaskQueue-Manager] org.apache.druid.indexing.overlord.TaskQueue - Asking taskRunner to run: index_parallel_${datasource}_eonnahel_2023-06-02T10:44:32.773Z

2023-06-02T10:44:32,778 INFO [qtp429393578-142] org.apache.druid.indexing.overlord.TaskLockbox - Adding task[index_parallel_${datasource}_eonnahel_2023-06-02T10:44:32.773Z] to activeTasks

2023-06-02T10:44:32,774 INFO [qtp429393578-142] org.apache.druid.indexing.overlord.MetadataTaskStorage - Inserting task index_parallel_${datasource}_eonnahel_2023-06-02T10:44:32.773Z with status: TaskStatus{id=index_parallel_${datasource}_eonnahel_2023-06-02T10:44:32.773Z, status=RUNNING, duration=-1, errorMsg=null}
a
I debugged a cluster like this before and following setting changes on Overlord helped
Copy code
druid.indexer.runner.pendingTasksRunnerNumThreads=40
druid.indexer.runner.workerSyncNumThreads=20
you may also want to increase the number of max db connections to druid.
what version were you on before 26? 25 or some older version?
l
25
I'll try to bump workerSyncNumThreads and pendingTasksRunnerNumThreads
a
and the metadata connection limit. if you take the thread dump on leader overlord, you can see where the bottleneck is
l
You mean increase the max num connections in the metadata server or how many Druid can use (so a druid config property)?
a
how many druid can use. a druid config property. How many active tasks and supervisors do you have btw?
l
how many druid can use. a druid config property.
I cant find this property. What is it?
How many active tasks and supervisors do you have btw?
I should be like ~25, mostly Kafka ingestions, but also a few scheduled batch ones running often. But we’re never reaching this 25-ish as they got stuck in pending.
a
were you using httpRemoteTaskRunner or remoteTaskRunner?
l
We havent defined
druid.indexer.runner.type
, so I believe we were using remote up to 25 and httpRemote for the last week.
a
if my suggestions don't work, you can also consider switching back to remote.
l
yeah, we had an internal discussion like 15 minutes ago and as we’re under a high load right now, we decided to first go back to ‘remote’ and tried changes later under less critical situation. But thanks @Abhishek Agarwal!
a
Cool. I am curious to know how it went.
l
Change to
remote
solved the issue as soon as the overlords restarted. No more tasks get stuck. When I get a chance, I’ll try
httpRemote
again using the configs you suggested.
g
thanks for the update— this kind of feedback is helpful!
👍 1
l
I’m back again with the same problem. For the past week, I’m seeing again tasks in
PENDING
state for ~5 minutes, while my MM’s are in ~40-50% capacity. Curiously moving back from
httpRemote
to
remote
runner worked for a while (~10 days?), but we’re back to the same issue. Last time @Abhishek Agarwal suggested I should try
Copy code
druid.indexer.runner.pendingTasksRunnerNumThreads=40
druid.indexer.runner.workerSyncNumThreads=20
Checking the code looks like
remote
uses only pendingTasksRunnerNumThreads (default = 1), while
httpRemote
also uses workerSyncNumThreads (default = 5). He also suggested increasing the max simultaneous connections to the metadata store, but I’m unsure what this property is. Is it druid.sql.avatica.maxConnections? Any other stuff I can try here to unstuck the Overlord task assignment?
Last time I don’t remember to see kafka ingestions affected by it, but this time they are too. Also last time the PENDING was never longer than 5 minutes, but now it can go a bit higher than 10 minutes!
a
If you are switching to httpRemote, you can try
Copy code
druid.indexer.runner.pendingTasksRunnerNumThreads=20
druid.indexer.runner.workerSyncNumThreads=40
it's a bit different from what I suggested earlier. You should also increase the number of max db connections that druid allows. This can be done through
Copy code
druid.metadata.storage.connector.dbcp.maxIdle=64
druid.metadata.storage.connector.dbcp.maxTotal=64
64 is just an example
l
I got back to
httpRemote
and started slow with the parameters (this one is not a big druid cluster)
Copy code
pendingTasksRunnerNumThreads: 1 (default) -> 5
workerSyncNumThreads: 5 (default) -> 10
The results so far are good; tasks were assigned and started immediately. Probably when pendingTasksRunnerNumThreads=1 if the thread is stuck somehow, nothing else was scheduled, right?
As last time it was fixed just to moving to
remote
for ~10 days, I’ll keep a look here to check if this change actually fixes it.
s
Any update on how httpRemote is working for you?
l
Looks good since I’ve set
Copy code
pendingTasksRunnerNumThreads: 1 (default) -> 5
workerSyncNumThreads: 5 (default) -> 10
Maybe at least change pendingTasksRunnerNumThreads default for 2?
s
That’s good to hear. @Abhishek Agarwal thoughts on changing the default?
a
There is a change we are thinking of that will reduce the load on these threads. We will change the defaults after those changes are done.
👍 2