https://linen.dev logo
Join Slack
Powered by
# troubleshooting
  • j

    Jvalant Patel

    05/14/2025, 10:42 PM
    We are moving from legacy way of handling
    null
    to the latest Druid version where legacy mode is not supported, just wanted to get some help from here to know what should be the best strategy to upgrade druid if we have
    null
    and
    ""
    strings in the datasources and our queries rely on the legacy behavior. If we want to rewrite queries to handle three valued logic for
    null
    comparisons, what should be the strategy ? is there any generalized way to modify the queries ? we are still using native Druid query language.
    g
    • 2
    • 1
  • r

    Rohen

    05/19/2025, 1:14 PM
    what's the reason for missing segments ? sometimes it's fully available and sometimes some percent of segments missing.!
    b
    • 2
    • 1
  • r

    Rohen

    05/19/2025, 1:14 PM
    We're using AWS Kafka with DRUID.
  • u

    Udit Sharma

    05/19/2025, 1:43 PM
    Hi I am Facing a weird issue, where i have a table with two column customer and custId, both contains the same value. But for some reason this query return customers which are not present in IN filter.
    Copy code
    select distinct customer from events where __time BETWEEN TIMESTAMP '2025-03-20 12:30:00' 
        AND TIMESTAMP '2025-05-19 13:00:00' AND 
    customer IN (
              '2140', '1060', '2207', '1809', '2985', 
              '3026', '2947', '2955', '2367', '2464', 
              '899', '355', '3284', '3302', '1034', 
              '3015', '2127', '2123', '2731', '2109', 
              '2832', '2479', '2702', '2387', '1804', 
              '1018', '1364', '3467', '1028', '850'
            )
    While this seems to return the right results.
    Copy code
    select distinct custId from events where __time BETWEEN TIMESTAMP '2025-03-20 12:30:00' 
        AND TIMESTAMP '2025-05-19 13:00:00' AND 
    custId IN (
              '2140', '1060', '2207', '1809', '2985', 
              '3026', '2947', '2955', '2367', '2464', 
              '899', '355', '3284', '3302', '1034', 
              '3015', '2127', '2123', '2731', '2109', 
              '2832', '2479', '2702', '2387', '1804', 
              '1018', '1364', '3467', '1028', '850'
            )
    Druid Version : 26.0.0
    j
    b
    • 3
    • 3
  • j

    JRob

    05/22/2025, 5:53 PM
    I'm following the documentation to setup Protobuf parsing here but I get the following error:
    Copy code
    Cannot construct instance of `org.apache.druid.data.input.protobuf.FileBasedProtobufBytesDecoder`, problem: Cannot read descriptor file: file:/tmp/metrics.desc at [Source: (org.eclipse.jetty.server.HttpInputOverHTTP); line: 1, column: 1090] (through reference chain: org.apache.druid.indexing.kafka.KafkaSamplerSpec["spec"]->org.apache.druid.indexing.kafka.supervisor.KafkaSupervisorSpec["ioConfig"]->org.apache.druid.indexing.kafka.supervisor.KafkaSupervisorIOConfig["inputFormat"]->org.apache.druid.data.input.protobuf.ProtobufInputFormat["protoBytesDecoder"])
    I suspect that Druid is trying to download the filer over HTTP but we would never expose
    /tmp
    to the internet. Why doesn't it just grab the file locally? For example, this works:
    Copy code
    {
      "type": "index_parallel",
      "spec": {
        "ioConfig": {
          "type": "index_parallel",
          "inputSource": {
            "type": "local",
            "baseDir": "/tmp/",
            "filter": "metrics.desc"
          }
        },
        "tuningConfig": {
          "type": "index_parallel"
        }
      }
    }
    However, I can't get this working with
    inputFormat
    g
    • 2
    • 2
  • u

    Utkarsh Chaturvedi

    05/23/2025, 10:13 AM
    Hi folks. Our team is routinely facing 504s when tasks for ingestion. Our cluster is set up on k8s using helm. What we're observing is that the task is actually getting registered on Druid, but the response is getting delayed beyond the nginx/cloudflare required timeout. So when we re trigger the ingestion; it fails due to overlaping segments locked. Any way to resolve the major issue of task not responding with registered task ID in time? We can increase timeouts but would prefer tackling the main problem.
  • b

    Brindha Ramasamy

    05/23/2025, 6:30 PM
    Hi, We are not configuring connection pool detail explicitly in common.runtime.properties ( Druid 30.0 ) . What is the default values and when can I find that config.
    b
    j
    • 3
    • 3
  • r

    Rohen

    05/26/2025, 1:44 PM
    Hi, while setting up DRUID on EKS using helm, we want to use authentication. using druid-basic-security extension for this case, As per the documentation, the following is the given one but it is not accepted by the pods. druid.auth.authenticatorChain: '["MyBasicMetadataAuthenticator"]' druid.auth.authenticator.MyBasicMetadataAuthenticator.type: "basic" druid.auth.authorizers: '["MyBasicMetadataAuthorizer"]' druid.auth.authorizer.MyBasicMetadataAuthorizer.type: "basic" druid.escalator.type: "basic" druid.escalator.internalClientUsername: "druid_system" druid.escalator.internalClientPassword: "your_internal_password" druid.escalator.authorizerName: "MyBasicMetadataAuthorizer" any is there any specific format we need to maintain ? Ref - https://github.com/asdf2014/druid-helm
    b
    s
    • 3
    • 6
  • j

    JRob

    05/28/2025, 9:14 PM
    Has anyone else been able to enable the TaskCountStatsMonitor? I get errors on Middle Manager startup:
    Copy code
    1) No implementation for org.apache.druid.server.metrics.TaskCountStatsProvider was bound.
      while locating org.apache.druid.server.metrics.TaskCountStatsProvider
        for the 1st parameter of org.apache.druid.server.metrics.TaskCountStatsMonitor.<init>(TaskCountStatsMonitor.java:40)
    m
    i
    • 3
    • 9
  • h

    Hardik Bajaj

    05/29/2025, 7:12 PM
    Hey Team! I noticed druid query failed count metric only shows 500 errors and not 401s, in case the request is from an unauthorized user (Basic security). I also couldn't find any Metric that can tell from which username the request came from. Does anyone know if there is any metric or observability in logs available for this. It makes it difficult to know for sure no one is using the User creds in case we want to delete that user. TIA!
    g
    • 2
    • 1
  • s

    Seki Inoue

    06/02/2025, 4:47 PM
    Hello, I have a datasource with a very long name and it causes following error on spawning the kill task. Once it happens the entire cluster gets unstable for around 30 minutes and no new tasks are allocated despite the middle managers have free slots. Indeed the file being opened,
    .coordinator-issued_kil...
    had 265 bytes length and it exceeds the XFS limit of 255 bytes. Do you know any work around to forcibly kill those segments?
    Copy code
    2025-05-30T22:10:42,465 ERROR [qtp214761486-125] org.apache.druid.indexing.worker.WorkerTaskManager - Error while trying to persist assigned task[coordinator-issued_kill_<deducted_long_datasource_name_119_bytes>]
    java.nio.file.FileSystemException: var/tmp/persistent/task/workerTaskManagerTmp/.coordinator-issued_kill_<deducted_long_datasource_name_119_bytes>_dfhlgdae_2024-07-10T23:00:00.000Z_2024-07-18T00:00:00.000Z_2025-05-30T22:10:42.417Z.2aababbd-02a6-4002-9b9f-cba30bbea8a7: File name too long
    	at java.base/sun.nio.fs.UnixException.translateToIOException(UnixException.java:100) ~[?:?]
    	at java.base/sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:106) ~[?:?]
    	at java.base/sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:111) ~[?:?]
    	at java.base/sun.nio.fs.UnixFileSystemProvider.newFileChannel(UnixFileSystemProvider.java:181) ~[?:?]
    	at java.base/java.nio.channels.FileChannel.open(FileChannel.java:298) ~[?:?]
    	at java.base/java.nio.channels.FileChannel.open(FileChannel.java:357) ~[?:?]
    	at org.apache.druid.java.util.common.FileUtils.writeAtomically(FileUtils.java:271) ~[druid-processing-33.0.0.jar:33.0.0]
    ...
    a
    j
    • 3
    • 5
  • a

    Asit

    06/03/2025, 4:25 AM
    Hello , I am not able to query by segments table getting from metadata and also my segment tab in Druid Console is timing out . Anyway I can get the segment count or increase memory to get the segment information .
    g
    • 2
    • 2
  • j

    Jon Laberge

    06/04/2025, 5:35 AM
    đź‘‹ , I am trying to get a cluster up and running on GKE, my requirements are to use cloud storage for deepstorage/logs and use postgres/CloudSQL for metadata. I'm using the druidoperator alongside deploying my cluster, and using a zkless deployment, this partially works but I face the problem described in this issue. The most recent suggestion is to use
    kubernetes-overlord-extensions
    , however I see this error when the overlord is trying to start:
    Copy code
    Caused by: org.apache.commons.lang3.NotImplementedException: this druid.indexer.logs.type [class org.apache.druid.storage.google.GoogleTaskLogs] does not support managing task payloads yet. You will have to switch to using environment variables
    Is there something I should be changing in my task template?
    k
    • 2
    • 7
  • j

    Jimbo Slice

    06/06/2025, 9:55 PM
    Got a really strange issue guys, following query returns error: "Query results were truncated midstream. This may indicate a server-side error or a client-side issue. Try re-running your query using a lower limit."
    SELECT
    COUNT(*) As Entries,
    SUM(packets) as Packets,
    SUM(bytes) as Bytes,
    (SUM(bytes) / SUM(packets)) as AvgPacketSizeBytes,
    MIN(__time) as FirstSeen,
    MAX(__time) as LastSeen,
    TIMESTAMPDIFF(SECOND, MIN(__time), MAX(__time)) as DurationSeconds,
    (SUM(bytes) * 8 / TIMESTAMPDIFF(SECOND, MIN(__time), MAX(__time))) as AvgMbps,
    "pkt-srcaddr", "pkt-dstaddr", "protocol"
    FROM "AWSLogsVPC"
    WHERE "log-status"!='NODATA' AND "pkt-srcaddr"!='-' AND "action"='ACCEPT'
    GROUP BY "pkt-srcaddr", "pkt-dstaddr", "protocol"
    But when i remove the TIMESTAMPDIFF section for AvgMbps this does not happen:
    SELECT
    COUNT(*) As Entries,
    SUM(packets) as Packets,
    SUM(bytes) as Bytes,
    (SUM(bytes) / SUM(packets)) as AvgPacketSizeBytes,
    MIN(__time) as FirstSeen,
    MAX(__time) as LastSeen,
    TIMESTAMPDIFF(SECOND, MIN(__time), MAX(__time)) as DurationSeconds,
    (SUM(bytes) * 8) as AvgMbps,
    "pkt-srcaddr", "pkt-dstaddr", "protocol"
    FROM "AWSLogsVPC"
    WHERE "log-status"!='NODATA' AND "pkt-srcaddr"!='-' AND "action"='ACCEPT'
    GROUP BY "pkt-srcaddr", "pkt-dstaddr", "protocol"
    I've tried removing the "WHERE" because != is bad practice, no difference, I believe there is an issue here with subquerying (druid.server.http.maxsubqueryrows) - however this is not a subquery, this is a simple calculation in a simple query. This query runs perfectly without
    TIMESTAMPDIFF(SECOND, MIN(__time), MAX(__time))
    being called in AvgMbps. Any ideas on what could be wrong???
  • b

    Ben Krug

    06/06/2025, 10:07 PM
    I don't know whether it has to do with datasizes or timings, but I wonder whether the division is a problem somehow? Is DurationSeconds ever 0 or null? Just curious...
    j
    • 2
    • 8
  • v

    venkat

    06/07/2025, 8:50 AM
    đź‘‹ Hello, team!
  • v

    venkat

    06/07/2025, 8:54 AM
    Am having small dout actually we have 6historical services but in that one is not visible in druid console but when I check ps -ef |grep historical it showing as running but why am unable to see in the druid console..? Any idea
  • s

    sandy k

    06/09/2025, 4:43 AM
    Using coordinator api : http://x.x.x.x:8081/druid/coordinator/v1/servers?full, I am getting segments per data node for all servers show segments, except one with very low segments single digit segments. But when I restart this specific data node, it starts with showing loading count which higher. Say for server1: 2025-06-09T025235,172 INFO [main] org.apache.druid.server.coordination.SegmentLoadDropHandler - Loading segment cache file [1/33036][/data01/druid/segment vs For say server2 2025-06-08T182252,245 INFO [main] org.apache.druid.server.coordination.SegmentLoadDropHandler - Loading segment cache file [1/14294]. It shows up in UI with service running. This problematic datanode disconnects and also has zookeeper connectivity issues compared to other nodes. Is this node not part of cluster.
  • r

    Rohen

    06/09/2025, 5:53 AM
    Anyone implemented druid-security with druid deployed using HELM EKS ?
  • r

    Rushikesh Bankar

    06/09/2025, 10:32 AM
    Hi Team 👋 We recently discovered an issue with the kubernetes service discovery created this with more details- https://github.com/apache/druid/issues/18090 I will try to summarize it • the current implementation of the kubernetes service discovery leaves the responsibility to announce or unannounce the node/pod on the node/pod itself • But this causes issues when the pod is shutdown abruptly due to node not ready or node failures from the CSP • We observed 3-4 instances where such issue happened and druid master nodes and broker continued to detect the falty pod and resulted in monotonically increasing loadqueue size and choking of jetty threads on broker due to all the queries timing out on the faulty historical node. This had a very extended impact as opposed to what we see with any zk based druid cluster I am proposing this fix- https://github.com/apache/druid/pull/18089 this has been tested on the druid cluster by reproducing the abrupt termination could you please help me with the review? Thanks! cc: @kfaraz
    k
    • 2
    • 3
  • j

    JRob

    06/10/2025, 3:19 PM
    Our calls to
    sys.segments
    are taking upwards of 60 seconds on average. Likewise our Datasources tab in the Console takes an agonizingly long time to load. But I can't understand why it's so slow, our DB stats don't show any issues. The
    druid_segments
    table is only 1108 MB in size. From pg_stat_statements:
    Copy code
    query            | SELECT payload FROM druid_segments WHERE used=$1
    calls            | 734969
    total_exec_time  | 1318567198.0990858
    min_exec_time    | 733.308662
    max_exec_time    | 13879.650989
    mean_exec_time   | 1794.0446441947086
    stddev_exec_time | 581.4299142612549
    ----------------------------------------------
    query            | SELECT payload FROM druid_segments WHERE used = $1 AND dataSource = $2 AND ((start < $3 AND "end" > $4) OR (start = $7 AND "end" != $8 AND "end" > $5) OR (start != $9 AND "end" = $10 AND start < $6) OR (start = $11 AND "end" = $12))
    calls            | 4888478
    total_exec_time  | 31912869.00381691
    min_exec_time    | 0.007730999999999999
    max_exec_time    | 2166.647028
    mean_exec_time   | 6.528180960171064
    stddev_exec_time | 25.333075336970094
    ----------------------------------------------
    b
    a
    • 3
    • 25
  • d

    Dinesh

    06/12/2025, 4:55 AM
    Hello There is one problem we are facing now a days. When the batch ingestion tasks are running (index_parallel) when the task is about to complete , first the task status changes to 'None' and eventually task is failing with an error though all looks good during task execution. Unknown exception / org.apache.druid.rpc.ServiceNotAvailableException: Service [overlord] issued redirect to unknown URL [http://10.XX.XX.18:8081/druid/indexer/v1/tasks] / java.lang.RuntimeException
  • d

    Dinesh

    06/12/2025, 5:26 AM
    it has become a big bottleneck for us can someone please guide on this ?
    b
    g
    • 3
    • 5
  • r

    Riccardo Sale

    06/16/2025, 10:11 AM
    Hello ! Our use case of Druid is particular since we have thousands of datasources. We have recently experienced RDS CPU spikes during metric creation that have then been mitigated by modifying the following value:
    druid.audit.manager.maxPayloadSizeBytes
    Looking at the
    coordinator.compaction.config
    field we have seen that this json payload value have grown to over 30MB and it's still causing slowdown when queried. As an example the following query:
    SELECT payload FROM druid_segments WHERE used=?
    takes up to three second. Any suggestion to solve the above issue ? How can we reduce the general size of the payload in the
    coordinator.compaction.config
    ? Would it be possible to write a custom extension for this specific use case ? Thanks in advance !
    b
    g
    • 3
    • 9
  • r

    Rajesh Gottapu

    06/17/2025, 5:19 AM
    Hi All druid is crashing with below exception in supervisor logs. Any help would be appreciated. Thanks { "timestamp": "2025-06-17T045207.071Z", "exceptionClass": "org.apache.druid.rpc.ServiceClosedException", "message": "org.apache.druid.rpc.ServiceClosedException: Service [index_kafka_zpn_pse_0ed475b16f43ae1_gedegecj] is closed", "streamException": false }, { "timestamp": "2025-06-17T045618.294Z", "exceptionClass": "org.apache.druid.rpc.ServiceClosedException", "message": "org.apache.druid.rpc.ServiceClosedException: Service [index_kafka_zpn_pse_datasource_ed342e7ec84bbb9_hjdhclha] is closed", "streamException": false }, { "timestamp": "2025-06-17T045635.179Z", "exceptionClass": "org.apache.druid.rpc.ServiceClosedException", "message": "org.apache.druid.rpc.ServiceClosedException: Service [index_kafka_zpn_pse_datasource_ed342e7ec84bbb9_klpmnmkc] is closed", "streamException": false }
    g
    • 2
    • 3
  • n

    Nir Bar On

    06/17/2025, 11:06 AM
    Hey all, working with druid 25.0.0 , configuration of auto-compaction task show this , but on documentation this filed has defult value .. question - what is meaning of legacy setting , is inputSegmentSizeBytes is depricated / use / notUsed underline for compaction ?
    b
    • 2
    • 2
  • n

    Nir Bar On

    06/17/2025, 12:30 PM
    question regarding middle-manager / injestion task - is druid task can be configure to firs validate that disk space on “/var/druid/task” (directory use for task / segments creation) is behind some treshold .. before starting task execution .. as I had some cases got out of space exception on disk level , can we have some validation on the disk size before task starting ?
  • n

    Nir Bar On

    06/17/2025, 12:48 PM
    on compaction task I discover that some point in time capcity on /var/druid/task” directory is 1.7G on disk , can I reduce the max disk space compaction task took by changing some configuration on compaction task ?
  • c

    Cristi Aldulea

    06/18/2025, 7:46 AM
    Hi all, I'm working with Apache Druid and have introduced a second timestamp column called
    ingestionTimestamp
    to support a deduplication job. Additionally, I have a column named
    tags
    , which is a multi-value
    VARCHAR
    column. The deduplication is performed using an MSQ (Multi-Stage Query) like the following:
    Copy code
    REPLACE INTO "target-datasource" 
    OVERWRITE 
    WHERE "__time" >= TIMESTAMP'__MIN_TIME' 
      AND "__time" < TIMESTAMP'__MAX_TIME'
    
    SELECT 
        __time,
        LATEST_BY("entityId", MILLIS_TO_TIMESTAMP("ingestionTimestamp")) AS "entityId",
        LATEST_BY("entityName", MILLIS_TO_TIMESTAMP("ingestionTimestamp")) AS "entityName",
        LATEST_BY("tagSetA", MILLIS_TO_TIMESTAMP("ingestionTimestamp")) AS "tagSetA",
        LATEST_BY("tagSetB", MILLIS_TO_TIMESTAMP("ingestionTimestamp")) AS "tagSetB",
        MAX("ingestionTimestamp") AS ingestionTimestamp
    FROM "target-datasource"
    WHERE "__time" >= TIMESTAMP'__MIN_TIME' 
      AND "__time" < TIMESTAMP'__MAX_TIME'
    GROUP BY 
        __time, 
        "entityUID"
    PARTITIONED BY 'P1M';
    Problem: After running this query, the
    tags
    -like columns (
    tagSetA
    ,
    tagSetB
    ) are no longer in a multi-value format. This breaks downstream queries that rely on the multi-value nature of these columns. My understanding: MSQ might not support preserving multi-value columns directly, especially when using functions like
    LATEST_BY
    . Question: How can I run this kind of deduplication query while preserving the multi-value format of these columns? Is there a recommended approach or workaround in Druid to handle this scenario? Can someone help us with this problem, please?
    k
    g
    • 3
    • 4
  • v

    Vaibhav

    06/18/2025, 7:16 PM
    Hi all, I’m running into an issue with range-based partitioning ( Druid 27.0 ) during compaction on one of our heaviest datasources and would appreciate input from the community. Context: • Datasource is ingested via Kafka indexing (stream ingestion). • Daily volume: ~4 billion rows / ~110 GB uncompressed data. • Ingested with HOUR granularity, resulting in ~5,000 segments per day. • We run daily compaction with range partitioning on 5 dimensions • Compaction task uses 8 parallel subtasks with 4 GB heap each. Issue: - Compaction fails during the final segment merge phase. - First failure was heap OOM, which was resolved by increasing task heap from 3 GB → 4 GB. - Now getting the following error:
    Copy code
    org.apache.druid.java.util.common.IAE: Asked to add buffers[2,454,942,764] larger than configured max[2,147,483,647]
    at org.apache.druid.java.util.common.io.smoosh.FileSmoosher.addWithSmooshedWriter(FileSmoosher.java:168)
    • On investigation: compaction produces
    430 partitions
    , but the 430th partition (with
    end=null
    ) gets an unusually high number of rows (~800M+ rows). What I found: - A
    GROUP BY
    on the 5 range dimensions for a sample day gives ~11.5k unique combinations
    Copy code
    eg : 
    SELECT range_dim1, range_dim2, range_dim3, range_dim4, range_dim5 , count(*) as row_count
    WHERE __time < 1 day interval > 
    GROUP BY 1,2,3,4,5
    ORDER BY 1,2,3,4,5
    - However, partition 430 gets all combinations from ~9.5k to ~11.5k in one partition. - This violates the
    targetRowsPerSegment: 5M
    and
    maxRowsPerSegment: 7.5M
    config. Questions: • Are there better strategies to ensure partitioning respects row count limits ? • Is this behavior a bug or expected ? Any advice or insights appreciated.