This message was deleted Apache Druid #troubleshooting

Join Slack

This message was deleted.

# troubleshooting

Slackbot

06/01/2023, 6:18 AM

This message was deleted.

Abhishek Agarwal

06/01/2023, 8:25 AM

I will suggest moving to http based task assignment strategy that is used by default in 26.0.0. Task assignment can be slower in zookeeper and we usually see HTTP behaving better than zookeeper. If you still see issues, please report a github bug with thread dumps of Overlord process when it gets into such a state.

Didip Kerabat

06/01/2023, 4:43 PM

Did you turn on auto pruning of tasks? We usually never keep tasks older than 24 hours.

Sachidananda

06/05/2023, 3:49 AM

Hi @Didip Kerabat I have not done any change related to auto pruning, is there a setting for this?

Sachidananda

06/05/2023, 3:50 AM

Hi @Abhishek Agarwal i have turned on the HTTP instead of zk now.

Sachidananda

06/05/2023, 3:55 AM

Hi @Abhishek Agarwal i have turned on the HTTP instead of zk now. Thanks for the info. It handles more tasks then zk for sure. However every ~2 hours i can see cpu of overlord has spike then usual and task utilisation going down. our coordinator indexing period is PT1H. can that cause this? Do you know what can cause this ?

Abhishek Agarwal

06/05/2023, 4:15 AM

How long this slow utilization lasts for? Also, what version are you on?

Sachidananda

06/05/2023, 4:15 AM

we are on druid 25.0.0. This lasts for ~4mins

Abhishek Agarwal

06/05/2023, 5:32 AM

what's the metadata store db connection limit on Overlord? I will suggest increasing that. Otherwise, if the timing is predictable, then you can thread dumps on Overlord leader machine.

Sebastien Rosset

06/14/2023, 6:55 PM

Did you turn on auto pruning of tasks?

Are you referring to the recentlyFinishedThreshold parameter?

Didip Kerabat

06/16/2023, 12:14 AM

These 3 settings in overlord.properties. For example:

Copy code

druid.indexer.logs.kill.enabled={{ env("DRUIDOVERLORD_INDEXER_LOGS_KILL_ENABLED", "true") }}

# This is in milliseconds
# 7200000 = 2 hours
# Number of milliseconds of delay between successive executions of auto kill run.
druid.indexer.logs.kill.delay={{ env("DRUIDOVERLORD_INDEXER_LOGS_KILL_DELAY", "7200000") }}

# This is in milliseconds
# 86400000 = 24 hours
druid.indexer.logs.kill.durationToRetain={{ env("DRUIDOVERLORD_INDEXER_LOGS_KILL_DURATIONTORETAIN", "86400000") }}

Sachidananda

06/20/2023, 5:18 AM

HI @Abhishek Agarwal @Didip Kerabat I found the root cause of overlord high cpu usage. we have 2 datasources for which coordinator issued kill tasks are failing every 2 hours causing high overlord cpu usage for 5 mins till this kill task times out exception seen in logs

Copy code

java.nio.file.FileSystemException: /mnt/data/task/workerTaskManagerTmp/<temp_task_folder_name>: File name too long
	at sun.nio.fs.UnixException.translateToIOException(UnixException.java:100) ~[?:?]

if I submit kill task manually it passes as the prefix is

kill_*

instead of

coordinator-issued_kill_*

as it is 18 characters less looks like linux file system (ext4 in our case for middle manager instances as i checked ) is having 255 chars limit for file name. In druid is there a way to change this name? can we use uuid for creating folders instead of taskid as task id is only needed in metadata. Thanks in advance.

Abhishek Agarwal

06/20/2023, 5:47 AM

can't you increase that char limit? Seems low to me. FWIW, first time I am seeing something like this

Sachidananda

06/20/2023, 7:30 AM

yes very weired issue. here are few links i found. https://serverfault.com/questions/9546/filename-length-limits-on-linux https://unix.stackexchange.com/questions/721339/what-would-happen-if-i-change-path-max-and-name-max-in-limits-h

Shivji Kumar Jha

06/21/2023, 5:45 AM

@Abhishek Agarwal task id is 255 chars max. Datasource is a part of task id ( in addition to other fields). Datasource is also 255 max. The task_id PK should ideally be a UUID and the time, interval etc should be separate columns. Thoughts? We are happy to contribute this change but I am not sure if this will break something on the compatibility, old scripts for some people etc. You could generally hide this with a config for compatibility but then this is to do with schema so that gets complicated too!

Abhishek Agarwal

06/21/2023, 6:01 AM

it shouldn't impact compatibility in any way. whats the full task id btw?

Shivji Kumar Jha

06/21/2023, 6:27 AM

@Abhishek Agarwal Druid joins fields here:

Copy code

static String newTaskId(
      @Nullable String idPrefix,
      String idSuffix,
      DateTime now,
      String typeName,
      String dataSource,
      @Nullable Interval interval
  )

Then there is also "coordinator-issued" and a UUID that gets in somewhere else. Here is an example ( datasource name removed)

Copy code

coordinator-issued_kill_<DATASOURCE_NAME>_kbhnnlld_2023-05-21T00:00:00.000Z_2023-06-19T00:00:00.000Z_2023-06-21T05:46:17.991Z.599f0ca8-254a-42a7-a702-8f47ecfcbe1b

Then the trouble is not all the fields joined are part of columns in the task table. So have to add columns perhaps to ensure all info remains with tasks. Hopefully we dont parse id to get interval, time of task issues etc

Shivji Kumar Jha

06/21/2023, 6:33 AM

Here is the code block : https://github.com/apache/druid/blob/9a78059ffbb22f0bf6074cd81b13078a2d4501bc/core/src/main/java/org/apache/druid/common/utils/IdUtils.java#L107

Abhishek Agarwal

06/21/2023, 6:55 AM

whats the length of datasource name? In case, you can't share the actual name

Abhishek Agarwal

06/21/2023, 7:03 AM

No. We don't parse anything. Though it makes the search easier in the console.

✅ 1

Shivji Kumar Jha

06/21/2023, 7:04 AM

@Abhishek Agarwal our datasource name is around 105 chars.

Abhishek Agarwal

06/21/2023, 7:04 AM

that information should be present in spec

🙌 1

🙏 1

Abhishek Agarwal

06/21/2023, 7:05 AM

I still think you should increase the limit on your box because someone could create a datasource with bigger name

🆗 1

Abhishek Agarwal

06/21/2023, 7:07 AM

or if its a standard limit in linux, then you can add code to truncate the task id

Shivji Kumar Jha

06/21/2023, 7:07 AM

255 is the default limit in linux.

Abhishek Agarwal

06/21/2023, 7:08 AM

I guess you could add the code to truncate if the length exceeds 255

🆒 1

2 Views

Open in Slack

Previous Next