This message was deleted Apache Druid #troubleshooting

Join Slack

This message was deleted.

# troubleshooting

Slackbot

06/01/2023, 5:52 PM

This message was deleted.

Didip Kerabat

06/01/2023, 6:07 PM

not enough replication thread on coordinator dynamic config?

Sergio Ferragut

06/01/2023, 6:35 PM

Does the overlord log show any errors or warnings? The overlord is what assigns tasks to workers. 26.0 defaults to

httpRemote

for

druid.indexer.runner.type

which has proven better than using

remote

which uses ZK to assign tasks. ZK has been known to cause some issues when assigning tasks. I mention it in case you've upgraded and are maybe still using

remote

Abhishek Agarwal

06/02/2023, 3:31 AM

How many tasks do you usually have in the system?

Luiz Augusto

06/02/2023, 10:16 AM

I see logs like this over and over. They are around for up to 5 minutes (which I’m assuming is

druid.indexer.runner.taskAssignmentTimeout

, default is PT5M) then things are ok again.

Copy code

2023-06-01T17:32:30,139 INFO [TaskQueue-Manager] org.apache.druid.indexing.overlord.hrtr.HttpRemoteTaskRunner - Assigned a task[index_parallel_${datasource}_okkgapeo_2023-06-01T17:28:32.721Z] that is known already. Ignored.

2023-06-01T17:32:29,914 DEBUG [qtp472702055-190] org.apache.druid.jetty.RequestLog - 10.6.125.179 GET //10.6.114.212:8081/druid/indexer/v1/pendingTasks?datasource=${datasource} HTTP/1.1 200

2023-06-01T17:32:06,349 INFO [TaskQueue-Manager] org.apache.druid.indexing.overlord.hrtr.HttpRemoteTaskRunner - Assigned a task[index_parallel_${datasource}_okkgapeo_2023-06-01T17:28:32.721Z] that is known already. Ignored.

2023-06-01T17:31:39,862 INFO [TaskQueue-Manager] org.apache.druid.indexing.overlord.hrtr.HttpRemoteTaskRunner - Assigned a task[index_parallel_${datasource}_okkgapeo_2023-06-01T17:28:32.721Z] that is known already. Ignored.

Also I can see this log message starts from May 25, so it’s right after Druid 26 deployment (we quickly upgraded it as we needed window functions)

Luiz Augusto

06/02/2023, 10:39 AM

Also, good chunk of exceptions like

Copy code

2023-06-02T03:58:39,708 WARN [HttpClient-Netty-Boss-0] org.jboss.netty.channel.SimpleChannelUpstreamHandler - EXCEPTION, please implement org.jboss.netty.handler.codec.http.HttpContentDecompressor.exceptionCaught() for proper handling.

2023-06-02T03:58:39,708 INFO [ServiceClientFactory-2] org.apache.druid.rpc.ServiceClientImpl - Service [index_kafka_${datasource}_ade43dbfaf8b849_kcmanoee] request [GET <http://10.6.149.241:8102/druid/worker/v1/chat/index_kafka_${datasource}_ade43dbfaf8b849_kcmanoee/time/start>] encountered exception on attempt #2; retrying in 4,000 ms (org.jboss.netty.channel.ChannelException: Faulty channel in resource pool)

Luiz Augusto

06/02/2023, 11:04 AM

Heres the flow of a task pending for 5 minutes

Copy code

2023-06-02T10:49:26,131 INFO [HttpRemoteTaskRunner-worker-sync-2] org.apache.druid.indexing.overlord.hrtr.HttpRemoteTaskRunner - Task[index_parallel_${datasource}_eonnahel_2023-06-02T10:44:32.773Z] location changed on worker[10.6.141.151:8091]. new location[TaskLocation{host='10.6.141.151', port=8100, tlsPort=-1}].

2023-06-02T10:49:26,125 INFO [HttpRemoteTaskRunner-worker-sync-0] org.apache.druid.indexing.overlord.hrtr.HttpRemoteTaskRunner - Task[index_parallel_${datasource}_eonnahel_2023-06-02T10:44:32.773Z] started RUNNING on worker[10.6.141.151:8091].

2023-06-02T10:49:26,120 INFO [hrtr-pending-tasks-runner-0] org.apache.druid.indexing.overlord.hrtr.HttpRemoteTaskRunner - Assigning task [index_parallel_${datasource}_eonnahel_2023-06-02T10:44:32.773Z] to worker [10.6.141.151:8091]

2023-06-02T10:49:26,113 INFO [TaskQueue-Manager] org.apache.druid.indexing.overlord.hrtr.HttpRemoteTaskRunner - Assigned a task[index_parallel_${datasource}_eonnahel_2023-06-02T10:44:32.773Z] that is known already. Ignored.

(...)

2023-06-02T10:45:26,112 INFO [TaskQueue-Manager] org.apache.druid.indexing.overlord.hrtr.HttpRemoteTaskRunner - Assigned a task[index_parallel_${datasource}_eonnahel_2023-06-02T10:44:32.773Z] that is known already. Ignored.

2023-06-02T10:44:53,035 INFO [Coordinator-Exec--0] org.apache.druid.server.coordinator.duty.UnloadUnusedSegments - Dropping uneeded segment [${datasource}_2023-06-01T00:00:00.000Z_2023-06-02T00:00:00.000Z_2023-06-02T10:42:12.971Z] from server [10.6.141.66:8083] in tier [_default_tier]

2023-06-02T10:44:53,020 INFO [Coordinator-Exec--0] org.apache.druid.server.coordinator.duty.UnloadUnusedSegments - Dropping uneeded segment [${datasource}_2023-06-01T00:00:00.000Z_2023-06-02T00:00:00.000Z_2023-06-02T10:42:12.971Z] from server [10.6.123.81:8083] in tier [_default_tier]

2023-06-02T10:44:48,170 INFO [Coordinator-Exec--0] org.apache.druid.server.coordinator.DruidCoordinator - Successfully marked [1] segments of datasource [${datasource}] as unused

2023-06-02T10:44:43,209 INFO [Coordinator-Exec--0] org.apache.druid.server.coordinator.DruidCoordinator - Successfully marked [1] segments of datasource [${datasource}] as unused

2023-06-02T10:44:38,143 INFO [Coordinator-Exec--0] org.apache.druid.server.coordinator.DruidCoordinator - Successfully marked [1] segments of datasource [${datasource}] as unused

2023-06-02T10:44:33,272 INFO [TaskQueue-Manager] org.apache.druid.indexing.overlord.hrtr.HttpRemoteTaskRunner - Assigned a task[index_parallel_${datasource}_eonnahel_2023-06-02T10:44:32.773Z] that is known already. Ignored.

2023-06-02T10:44:33,186 INFO [Coordinator-Exec--0] org.apache.druid.server.coordinator.DruidCoordinator - Successfully marked [1] segments of datasource [${datasource}] as unused

2023-06-02T10:44:32,778 INFO [TaskQueue-Manager] org.apache.druid.indexing.overlord.hrtr.HttpRemoteTaskRunner - Adding pending task[index_parallel_${datasource}_eonnahel_2023-06-02T10:44:32.773Z].

2023-06-02T10:44:32,778 INFO [TaskQueue-Manager] org.apache.druid.indexing.overlord.TaskQueue - Asking taskRunner to run: index_parallel_${datasource}_eonnahel_2023-06-02T10:44:32.773Z

2023-06-02T10:44:32,778 INFO [qtp429393578-142] org.apache.druid.indexing.overlord.TaskLockbox - Adding task[index_parallel_${datasource}_eonnahel_2023-06-02T10:44:32.773Z] to activeTasks

2023-06-02T10:44:32,774 INFO [qtp429393578-142] org.apache.druid.indexing.overlord.MetadataTaskStorage - Inserting task index_parallel_${datasource}_eonnahel_2023-06-02T10:44:32.773Z with status: TaskStatus{id=index_parallel_${datasource}_eonnahel_2023-06-02T10:44:32.773Z, status=RUNNING, duration=-1, errorMsg=null}

Abhishek Agarwal

06/02/2023, 11:18 AM

I debugged a cluster like this before and following setting changes on Overlord helped

Copy code

druid.indexer.runner.pendingTasksRunnerNumThreads=40
druid.indexer.runner.workerSyncNumThreads=20

Abhishek Agarwal

06/02/2023, 11:18 AM

you may also want to increase the number of max db connections to druid.

Abhishek Agarwal

06/02/2023, 11:19 AM

what version were you on before 26? 25 or some older version?

Luiz Augusto

06/02/2023, 11:25 AM

Luiz Augusto

06/02/2023, 11:27 AM

I'll try to bump workerSyncNumThreads and pendingTasksRunnerNumThreads

Abhishek Agarwal

06/02/2023, 11:29 AM

and the metadata connection limit. if you take the thread dump on leader overlord, you can see where the bottleneck is

Luiz Augusto

06/02/2023, 11:50 AM

You mean increase the max num connections in the metadata server or how many Druid can use (so a druid config property)?

Abhishek Agarwal

06/02/2023, 12:01 PM

how many druid can use. a druid config property. How many active tasks and supervisors do you have btw?

Luiz Augusto

06/02/2023, 12:06 PM

how many druid can use. a druid config property.

I cant find this property. What is it?

How many active tasks and supervisors do you have btw?

I should be like ~25, mostly Kafka ingestions, but also a few scheduled batch ones running often. But we’re never reaching this 25-ish as they got stuck in pending.

Abhishek Agarwal

06/02/2023, 12:13 PM

were you using httpRemoteTaskRunner or remoteTaskRunner?

Luiz Augusto

06/02/2023, 12:15 PM

We havent defined

druid.indexer.runner.type

, so I believe we were using remote up to 25 and httpRemote for the last week.

Abhishek Agarwal

06/02/2023, 12:49 PM

if my suggestions don't work, you can also consider switching back to remote.

Luiz Augusto

06/02/2023, 1:01 PM

yeah, we had an internal discussion like 15 minutes ago and as we’re under a high load right now, we decided to first go back to ‘remote’ and tried changes later under less critical situation. But thanks @Abhishek Agarwal!

Abhishek Agarwal

06/02/2023, 1:55 PM

Cool. I am curious to know how it went.

Luiz Augusto

06/05/2023, 10:25 AM

Change to

remote

solved the issue as soon as the overlords restarted. No more tasks get stuck. When I get a chance, I’ll try

httpRemote

again using the configs you suggested.

Gian Merlino

06/07/2023, 11:59 AM

thanks for the update— this kind of feedback is helpful!

👍 1

Luiz Augusto

06/21/2023, 5:35 PM

I’m back again with the same problem. For the past week, I’m seeing again tasks in

PENDING

state for ~5 minutes, while my MM’s are in ~40-50% capacity. Curiously moving back from

httpRemote

remote

runner worked for a while (~10 days?), but we’re back to the same issue. Last time @Abhishek Agarwal suggested I should try

Copy code

druid.indexer.runner.pendingTasksRunnerNumThreads=40
druid.indexer.runner.workerSyncNumThreads=20

Checking the code looks like

remote

uses only pendingTasksRunnerNumThreads (default = 1), while

httpRemote

also uses workerSyncNumThreads (default = 5). He also suggested increasing the max simultaneous connections to the metadata store, but I’m unsure what this property is. Is it druid.sql.avatica.maxConnections? Any other stuff I can try here to unstuck the Overlord task assignment?

Luiz Augusto

06/21/2023, 6:48 PM

Last time I don’t remember to see kafka ingestions affected by it, but this time they are too. Also last time the PENDING was never longer than 5 minutes, but now it can go a bit higher than 10 minutes!

Abhishek Agarwal

06/22/2023, 3:55 AM

If you are switching to httpRemote, you can try

Copy code

druid.indexer.runner.pendingTasksRunnerNumThreads=20
druid.indexer.runner.workerSyncNumThreads=40

it's a bit different from what I suggested earlier. You should also increase the number of max db connections that druid allows. This can be done through

Copy code

druid.metadata.storage.connector.dbcp.maxIdle=64
druid.metadata.storage.connector.dbcp.maxTotal=64

64 is just an example

Luiz Augusto

06/22/2023, 8:49 AM

I got back to

httpRemote

and started slow with the parameters (this one is not a big druid cluster)

Copy code

pendingTasksRunnerNumThreads: 1 (default) -> 5
workerSyncNumThreads: 5 (default) -> 10

The results so far are good; tasks were assigned and started immediately. Probably when pendingTasksRunnerNumThreads=1 if the thread is stuck somehow, nothing else was scheduled, right?

Luiz Augusto

06/22/2023, 8:59 AM

As last time it was fixed just to moving to

remote

for ~10 days, I’ll keep a look here to check if this change actually fixes it.

Sergio Ferragut

06/27/2023, 4:14 PM

Any update on how httpRemote is working for you?

Luiz Augusto

06/27/2023, 5:08 PM

Looks good since I’ve set

Copy code

pendingTasksRunnerNumThreads: 1 (default) -> 5
workerSyncNumThreads: 5 (default) -> 10

Luiz Augusto

06/27/2023, 5:14 PM

Maybe at least change pendingTasksRunnerNumThreads default for 2?

Sergio Ferragut

06/27/2023, 6:24 PM

That’s good to hear. @Abhishek Agarwal thoughts on changing the default?

Abhishek Agarwal

06/28/2023, 3:16 AM

There is a change we are thinking of that will reduce the load on these threads. We will change the defaults after those changes are done.

👍 2

10 Views

Open in Slack

Previous Next