This message was deleted.
# troubleshooting
s
This message was deleted.
a
I will suggest moving to http based task assignment strategy that is used by default in 26.0.0. Task assignment can be slower in zookeeper and we usually see HTTP behaving better than zookeeper. If you still see issues, please report a github bug with thread dumps of Overlord process when it gets into such a state.
d
Did you turn on auto pruning of tasks? We usually never keep tasks older than 24 hours.
s
Hi @Didip Kerabat I have not done any change related to auto pruning, is there a setting for this?
Hi @Abhishek Agarwal i have turned on the HTTP instead of zk now.
Hi @Abhishek Agarwal i have turned on the HTTP instead of zk now. Thanks for the info. It handles more tasks then zk for sure. However every ~2 hours i can see cpu of overlord has spike then usual and task utilisation going down. our coordinator indexing period is PT1H. can that cause this? Do you know what can cause this ?
a
How long this slow utilization lasts for? Also, what version are you on?
s
we are on druid 25.0.0. This lasts for ~4mins
a
what's the metadata store db connection limit on Overlord? I will suggest increasing that. Otherwise, if the timing is predictable, then you can thread dumps on Overlord leader machine.
s
Did you turn on auto pruning of tasks?
Are you referring to the recentlyFinishedThreshold parameter?
d
These 3 settings in overlord.properties. For example:
Copy code
druid.indexer.logs.kill.enabled={{ env("DRUIDOVERLORD_INDEXER_LOGS_KILL_ENABLED", "true") }}

# This is in milliseconds
# 7200000 = 2 hours
# Number of milliseconds of delay between successive executions of auto kill run.
druid.indexer.logs.kill.delay={{ env("DRUIDOVERLORD_INDEXER_LOGS_KILL_DELAY", "7200000") }}

# This is in milliseconds
# 86400000 = 24 hours
druid.indexer.logs.kill.durationToRetain={{ env("DRUIDOVERLORD_INDEXER_LOGS_KILL_DURATIONTORETAIN", "86400000") }}
s
HI @Abhishek Agarwal @Didip Kerabat I found the root cause of overlord high cpu usage. we have 2 datasources for which coordinator issued kill tasks are failing every 2 hours causing high overlord cpu usage for 5 mins till this kill task times out exception seen in logs
Copy code
java.nio.file.FileSystemException: /mnt/data/task/workerTaskManagerTmp/<temp_task_folder_name>: File name too long
	at sun.nio.fs.UnixException.translateToIOException(UnixException.java:100) ~[?:?]
if I submit kill task manually it passes as the prefix is
kill_*
instead of
coordinator-issued_kill_*
as it is 18 characters less looks like linux file system (ext4 in our case for middle manager instances as i checked ) is having 255 chars limit for file name. In druid is there a way to change this name? can we use uuid for creating folders instead of taskid as task id is only needed in metadata. Thanks in advance.
a
can't you increase that char limit? Seems low to me. FWIW, first time I am seeing something like this
s
@Abhishek Agarwal task id is 255 chars max. Datasource is a part of task id ( in addition to other fields). Datasource is also 255 max. The task_id PK should ideally be a UUID and the time, interval etc should be separate columns. Thoughts? We are happy to contribute this change but I am not sure if this will break something on the compatibility, old scripts for some people etc. You could generally hide this with a config for compatibility but then this is to do with schema so that gets complicated too!
a
it shouldn't impact compatibility in any way. whats the full task id btw?
s
@Abhishek Agarwal Druid joins fields here:
Copy code
static String newTaskId(
      @Nullable String idPrefix,
      String idSuffix,
      DateTime now,
      String typeName,
      String dataSource,
      @Nullable Interval interval
  )
Then there is also "coordinator-issued" and a UUID that gets in somewhere else. Here is an example ( datasource name removed)
Copy code
coordinator-issued_kill_<DATASOURCE_NAME>_kbhnnlld_2023-05-21T00:00:00.000Z_2023-06-19T00:00:00.000Z_2023-06-21T05:46:17.991Z.599f0ca8-254a-42a7-a702-8f47ecfcbe1b
Then the trouble is not all the fields joined are part of columns in the task table. So have to add columns perhaps to ensure all info remains with tasks. Hopefully we dont parse id to get interval, time of task issues etc
a
whats the length of datasource name? In case, you can't share the actual name
No. We don't parse anything. Though it makes the search easier in the console.
1
s
@Abhishek Agarwal our datasource name is around 105 chars.
a
that information should be present in spec
🙌 1
🙏 1
I still think you should increase the limit on your box because someone could create a datasource with bigger name
🆗 1
or if its a standard limit in linux, then you can add code to truncate the task id
s
255 is the default limit in linux.
a
I guess you could add the code to truncate if the length exceeds 255
🆒 1