https://linen.dev logo
Join Slack
Powered by
# troubleshooting
  • l

    Luke Foskey

    05/01/2025, 3:21 AM
    Hi team, i can not remember if this is typical behaviour. But we had a middlemanager/overlord fail over that occurred during an ingestion. Subsequently the console lost the tasks and the next leader showed a bunch of tasks failed and it appeared it stopped. is there a way to circumvent this behaviour?
    g
    • 2
    • 2
  • m

    Mahesha Subrahamanya

    05/04/2025, 5:47 AM
    Hello Team, I have a scenario where joining 2 datasource with a huge dataset, like datasource 1 - 80 million and datasource 2 - 95 million If a datasource joins another datasource where it brings almost all records, then in this case, does this matter to have both datasource ALL or DAY partitions? Currently, we create them as ALL since we know that all data is being pulled for joining operations. So any input on this scenario? Thank you.
    j
    • 2
    • 8
  • m

    Mahesha Subrahamanya

    05/06/2025, 9:20 PM
    Hello Team, Druid 31 version. I have a query about NULL column data. I'm trying to put DISTINCT in the SELECT clause. Why do we have this constraint, and how do we overcome it? Error: INVALID_INPUT (ADMIN) Query could not be planned. A possible reason is [SQL requires a group-by on a column with unknown type that is unsupported.] SELECT distinct col1, col2,col3, NULL as col4, NULL as col4 from temp1.
    g
    • 2
    • 5
  • l

    Lis Shimoni

    05/06/2025, 9:22 PM
    Hey Team, I'm working on creating
    druid-scaling
    automatic process and need some guidance. Is there an API that allows me to stop requests for historical and broker nodes before removing them? For example, something like:
    Copy code
    curl -X POST "<http://localhost:8081/druid/coordinator/v1/servers/10.120.122.122:8083/disable>"
    I've tried this approach, but it didn't work. Any help would be appreciated!
    g
    • 2
    • 2
  • s

    Stefanos Pliakos

    05/07/2025, 7:01 AM
    Good morning. We are running our Druid clusters version 29.0.1 in Ubuntu 24.04 running in EC2 instances. I have noticed what I’m going to describe in earlier versions of Apache Druid and earlier versions of Ubuntu as well. We are running our coordinator and overlord processes (coordinator as overlord) as one process. I have noticed quite a few times that leader election goes awry and for example we could have b01 node as coordinator and c01 as overlord (which is not the normal or intended, coordinator and overlord processes should be in the same node. This has as a result the ingestions to be failing since we have druid behind an ALB (application load balancer). Why is this happening? I haven’t noticed anything indicative in the logs. Is it a bug? Looks like it
    g
    • 2
    • 12
  • u

    Udit Sharma

    05/07/2025, 9:35 AM
    Facing weird issue where few of the historical thread just stuck and running forever at the below place. there are timeout sets on the queries but this query is running since 24 hour. any idea what could be potential issue ? and what can be done to kill/cancel this query.
    Copy code
    "groupBy_JoinDataSource{left=event-trst-36745eaf-8215-4dd2-baed-a6bc123e3b02, right=InlineDataSource{signature={d0:LONG}}, rightPrefix='__j0.', condition=("long42" == "__j0.d0"), joinType=INNER, leftFilter=null}_[2025-04-04T00:00:00.000Z/2025-04-05T00:00:00.000Z]_54f468cf-e44b-4835-895a-e3f48cec0db4" #205 daemon prio=5 os_prio=0 cpu=67305835.68ms elapsed=763356.58s tid=0x00007b2f00492800 nid=0x10f runnable  [0x00007b2c06af4000]
       java.lang.Thread.State: RUNNABLE
    	at org.apache.druid.segment.join.HashJoinEngine$1JoinCursor.matchCurrentPosition(HashJoinEngine.java:193)
    	at org.apache.druid.segment.join.HashJoinEngine$1JoinCursor.advanceUninterruptibly(HashJoinEngine.java:223)
    	at org.apache.druid.segment.join.HashJoinEngine$1JoinCursor.advance(HashJoinEngine.java:180)
    	at org.apache.druid.segment.join.HashJoinEngine$1JoinCursor.initialize(HashJoinEngine.java:157)
    	at org.apache.druid.segment.join.HashJoinEngine.makeJoinCursor(HashJoinEngine.java:254)
    	at org.apache.druid.segment.join.HashJoinSegmentStorageAdapter.lambda$makeCursors$1(HashJoinSegmentStorageAdapter.java:339)
    	at org.apache.druid.segment.join.HashJoinSegmentStorageAdapter$$Lambda$1184/0x00000008009bf440.apply(Unknown Source)
    	at org.apache.druid.java.util.common.guava.Sequences$$Lambda$653/0x0000000800aa7040.apply(Unknown Source)
    	at org.apache.druid.java.util.common.guava.MappingAccumulator.accumulate(MappingAccumulator.java:40)
    	at org.apache.druid.java.util.common.guava.FilteringAccumulator.accumulate(FilteringAccumulator.java:41)
    	at org.apache.druid.java.util.common.guava.MappingAccumulator.accumulate(MappingAccumulator.java:40)
    	at org.apache.druid.java.util.common.guava.BaseSequence.accumulate(BaseSequence.java:44)
    	at org.apache.druid.java.util.common.guava.MappedSequence.accumulate(MappedSequence.java:43)
    	at org.apache.druid.java.util.common.guava.WrappingSequence$1.get(WrappingSequence.java:50)
    	at org.apache.druid.java.util.common.guava.SequenceWrapper.wrap(SequenceWrapper.java:55)
    	at org.apache.druid.java.util.common.guava.WrappingSequence.accumulate(WrappingSequence.java:45)
    	at org.apache.druid.java.util.common.guava.FilteredSequence.accumulate(FilteredSequence.java:45)
    	at org.apache.druid.java.util.common.guava.MappedSequence.accumulate(MappedSequence.java:43)
    	at org.apache.druid.java.util.common.guava.WrappingSequence$1.get(WrappingSequence.java:50)
    	at org.apache.druid.java.util.common.guava.SequenceWrapper.wrap(SequenceWrapper.java:55)
    	at org.apache.druid.java.util.common.guava.WrappingSequence.accumulate(WrappingSequence.java:45)
    	at org.apache.druid.java.util.common.guava.MappedSequence.accumulate(MappedSequence.java:43)
    	at org.apache.druid.java.util.common.guava.ConcatSequence.accumulate(ConcatSequence.java:42)
    	at org.apache.druid.java.util.common.guava.WrappingSequence$1.get(WrappingSequence.java:50)
    	at org.apache.druid.java.util.common.guava.SequenceWrapper.wrap(SequenceWrapper.java:55)
    	at org.apache.druid.java.util.common.guava.WrappingSequence.accumulate(WrappingSequence.java:45)
    	at org.apache.druid.java.util.common.guava.WrappingSequence$1.get(WrappingSequence.java:50)
    	at org.apache.druid.java.util.common.guava.SequenceWrapper.wrap(SequenceWrapper.java:55)
    	at org.apache.druid.java.util.common.guava.WrappingSequence.accumulate(WrappingSequence.java:45)
    	at org.apache.druid.java.util.common.guava.LazySequence.accumulate(LazySequence.java:40)
    	at org.apache.druid.java.util.common.guava.WrappingSequence$1.get(WrappingSequence.java:50)
    	at org.apache.druid.java.util.common.guava.SequenceWrapper.wrap(SequenceWrapper.java:55)
    	at org.apache.druid.java.util.common.guava.WrappingSequence.accumulate(WrappingSequence.java:45)
    	at org.apache.druid.java.util.common.guava.LazySequence.accumulate(LazySequence.java:40)
    	at org.apache.druid.java.util.common.guava.WrappingSequence$1.get(WrappingSequence.java:50)
    	at org.apache.druid.java.util.common.guava.SequenceWrapper.wrap(SequenceWrapper.java:55)
    	at org.apache.druid.java.util.common.guava.WrappingSequence.accumulate(WrappingSequence.java:45)
    	at org.apache.druid.query.spec.SpecificSegmentQueryRunner$1.accumulate(SpecificSegmentQueryRunner.java:103)
    	at org.apache.druid.java.util.common.guava.WrappingSequence$1.get(WrappingSequence.java:50)
    	at org.apache.druid.query.spec.SpecificSegmentQueryRunner.doNamed(SpecificSegmentQueryRunner.java:190)
    	at org.apache.druid.query.spec.SpecificSegmentQueryRunner.access$100(SpecificSegmentQueryRunner.java:45)
    	at org.apache.druid.query.spec.SpecificSegmentQueryRunner$2.wrap(SpecificSegmentQueryRunner.java:170)
    	at org.apache.druid.java.util.common.guava.WrappingSequence.accumulate(WrappingSequence.java:45)
    	at org.apache.druid.java.util.common.guava.WrappingSequence$1.get(WrappingSequence.java:50)
    	at org.apache.druid.query.CPUTimeMetricQueryRunner$1.wrap(CPUTimeMetricQueryRunner.java:110)
    	at org.apache.druid.java.util.common.guava.WrappingSequence.accumulate(WrappingSequence.java:45)
    	at org.apache.druid.query.groupby.epinephelinae.GroupByMergingQueryRunnerV2$1$1$1.call(GroupByMergingQueryRunnerV2.java:252)
    	at org.apache.druid.query.groupby.epinephelinae.GroupByMergingQueryRunnerV2$1$1$1.call(GroupByMergingQueryRunnerV2.java:239)
    	at java.util.concurrent.FutureTask.run(java.base@11.0.7/FutureTask.java:264)
    	at org.apache.druid.query.PrioritizedListenableFutureTask.run(PrioritizedExecutorService.java:251)
    	at java.util.concurrent.ThreadPoolExecutor.runWorker(java.base@11.0.7/ThreadPoolExecutor.java:1128)
    	at java.util.concurrent.ThreadPoolExecutor$Worker.run(java.base@11.0.7/ThreadPoolExecutor.java:628)
    	at java.lang.Thread.run(java.base@11.0.7/Thread.java:834)
    g
    • 2
    • 4
  • m

    Mahesha Subrahamanya

    05/08/2025, 2:25 AM
    Hello Team, In druid 31, Below query is not supported however is there any plan to support in latest Druid version. SELECT col1, col2 FROM temp LEFT JOIN temp2 ON (col1 = col2 OR col2 = col3); Error: INVALID_INPUT (ADMIN) Query could not be planned. A possible reason is [SQL requires a join with 'OR' condition that is not supported.]
    g
    • 2
    • 2
  • s

    suhas panchbhai

    05/08/2025, 8:16 AM
    HI Team in druid 27.0 , my compactions are running really slow , it is expected to run slow as this is running for the first time for 17 TB Datasource, i have increased taskslots for compaction , anyway to expedite this process?
    k
    • 2
    • 6
  • m

    mehrdadbn9

    05/10/2025, 7:33 PM
    Hi everyone I have problem on kill task and data still exist on deepstorage i have changed retentation rule but no change happened (beside default one retentation rule we have dropbefore 3M) what should i do our version is 31.0.1
    👍 1
    j
    • 2
    • 7
  • m

    Mahesha Subrahamanya

    05/12/2025, 9:50 PM
    Hello Team, We have a production environment where multi-cluster deployments are using Druid services. Currently, we have a global user. Secrets are created in production environments secured by our TechOps team to be accessed by its global users. Is there any strategy we can implement with multiple users, like read, write, admin users, etc? We use EKS for Druid services. Could anybody have implemented this or any strategy? Please let me know. Thank you.
    g
    • 2
    • 6
  • s

    Soman Ullah

    05/13/2025, 11:23 PM
    How can queries locked out at the broker layer be fixed? In my cluster during high qps scenario I have queries taking longer at the broker but historicals respond quickly. The query/node/time takes a majority of the time and query/node/ttfb is fast.
    j
    • 2
    • 2
  • k

    Konstantin Minevskiy

    05/14/2025, 5:15 PM
    Hello! We’re having issues adding
    loading lookup
    . Global cached lookup (JDBC, specifically) works fine, but it’s becoming a problem due to the size of the data being retrieved from the db. We’re on the latest Druid (33.0.0). More info in the 🧵
    g
    a
    • 3
    • 25
  • j

    Jvalant Patel

    05/14/2025, 10:42 PM
    We are moving from legacy way of handling
    null
    to the latest Druid version where legacy mode is not supported, just wanted to get some help from here to know what should be the best strategy to upgrade druid if we have
    null
    and
    ""
    strings in the datasources and our queries rely on the legacy behavior. If we want to rewrite queries to handle three valued logic for
    null
    comparisons, what should be the strategy ? is there any generalized way to modify the queries ? we are still using native Druid query language.
    g
    • 2
    • 1
  • r

    Rohen

    05/19/2025, 1:14 PM
    what's the reason for missing segments ? sometimes it's fully available and sometimes some percent of segments missing.!
    b
    • 2
    • 1
  • r

    Rohen

    05/19/2025, 1:14 PM
    We're using AWS Kafka with DRUID.
  • u

    Udit Sharma

    05/19/2025, 1:43 PM
    Hi I am Facing a weird issue, where i have a table with two column customer and custId, both contains the same value. But for some reason this query return customers which are not present in IN filter.
    Copy code
    select distinct customer from events where __time BETWEEN TIMESTAMP '2025-03-20 12:30:00' 
        AND TIMESTAMP '2025-05-19 13:00:00' AND 
    customer IN (
              '2140', '1060', '2207', '1809', '2985', 
              '3026', '2947', '2955', '2367', '2464', 
              '899', '355', '3284', '3302', '1034', 
              '3015', '2127', '2123', '2731', '2109', 
              '2832', '2479', '2702', '2387', '1804', 
              '1018', '1364', '3467', '1028', '850'
            )
    While this seems to return the right results.
    Copy code
    select distinct custId from events where __time BETWEEN TIMESTAMP '2025-03-20 12:30:00' 
        AND TIMESTAMP '2025-05-19 13:00:00' AND 
    custId IN (
              '2140', '1060', '2207', '1809', '2985', 
              '3026', '2947', '2955', '2367', '2464', 
              '899', '355', '3284', '3302', '1034', 
              '3015', '2127', '2123', '2731', '2109', 
              '2832', '2479', '2702', '2387', '1804', 
              '1018', '1364', '3467', '1028', '850'
            )
    Druid Version : 26.0.0
    j
    b
    • 3
    • 5
  • j

    JRob

    05/22/2025, 5:53 PM
    I'm following the documentation to setup Protobuf parsing here but I get the following error:
    Copy code
    Cannot construct instance of `org.apache.druid.data.input.protobuf.FileBasedProtobufBytesDecoder`, problem: Cannot read descriptor file: file:/tmp/metrics.desc at [Source: (org.eclipse.jetty.server.HttpInputOverHTTP); line: 1, column: 1090] (through reference chain: org.apache.druid.indexing.kafka.KafkaSamplerSpec["spec"]->org.apache.druid.indexing.kafka.supervisor.KafkaSupervisorSpec["ioConfig"]->org.apache.druid.indexing.kafka.supervisor.KafkaSupervisorIOConfig["inputFormat"]->org.apache.druid.data.input.protobuf.ProtobufInputFormat["protoBytesDecoder"])
    I suspect that Druid is trying to download the filer over HTTP but we would never expose
    /tmp
    to the internet. Why doesn't it just grab the file locally? For example, this works:
    Copy code
    {
      "type": "index_parallel",
      "spec": {
        "ioConfig": {
          "type": "index_parallel",
          "inputSource": {
            "type": "local",
            "baseDir": "/tmp/",
            "filter": "metrics.desc"
          }
        },
        "tuningConfig": {
          "type": "index_parallel"
        }
      }
    }
    However, I can't get this working with
    inputFormat
    g
    • 2
    • 2
  • u

    Utkarsh Chaturvedi

    05/23/2025, 10:13 AM
    Hi folks. Our team is routinely facing 504s when tasks for ingestion. Our cluster is set up on k8s using helm. What we're observing is that the task is actually getting registered on Druid, but the response is getting delayed beyond the nginx/cloudflare required timeout. So when we re trigger the ingestion; it fails due to overlaping segments locked. Any way to resolve the major issue of task not responding with registered task ID in time? We can increase timeouts but would prefer tackling the main problem.
  • b

    Brindha Ramasamy

    05/23/2025, 6:30 PM
    Hi, We are not configuring connection pool detail explicitly in common.runtime.properties ( Druid 30.0 ) . What is the default values and when can I find that config.
    b
    j
    • 3
    • 3
  • r

    Rohen

    05/26/2025, 1:44 PM
    Hi, while setting up DRUID on EKS using helm, we want to use authentication. using druid-basic-security extension for this case, As per the documentation, the following is the given one but it is not accepted by the pods. druid.auth.authenticatorChain: '["MyBasicMetadataAuthenticator"]' druid.auth.authenticator.MyBasicMetadataAuthenticator.type: "basic" druid.auth.authorizers: '["MyBasicMetadataAuthorizer"]' druid.auth.authorizer.MyBasicMetadataAuthorizer.type: "basic" druid.escalator.type: "basic" druid.escalator.internalClientUsername: "druid_system" druid.escalator.internalClientPassword: "your_internal_password" druid.escalator.authorizerName: "MyBasicMetadataAuthorizer" any is there any specific format we need to maintain ? Ref - https://github.com/asdf2014/druid-helm
    b
    s
    • 3
    • 6
  • j

    JRob

    05/28/2025, 9:14 PM
    Has anyone else been able to enable the TaskCountStatsMonitor? I get errors on Middle Manager startup:
    Copy code
    1) No implementation for org.apache.druid.server.metrics.TaskCountStatsProvider was bound.
      while locating org.apache.druid.server.metrics.TaskCountStatsProvider
        for the 1st parameter of org.apache.druid.server.metrics.TaskCountStatsMonitor.<init>(TaskCountStatsMonitor.java:40)
    m
    i
    • 3
    • 9
  • h

    Hardik Bajaj

    05/29/2025, 7:12 PM
    Hey Team! I noticed druid query failed count metric only shows 500 errors and not 401s, in case the request is from an unauthorized user (Basic security). I also couldn't find any Metric that can tell from which username the request came from. Does anyone know if there is any metric or observability in logs available for this. It makes it difficult to know for sure no one is using the User creds in case we want to delete that user. TIA!
    g
    • 2
    • 1
  • s

    Seki Inoue

    06/02/2025, 4:47 PM
    Hello, I have a datasource with a very long name and it causes following error on spawning the kill task. Once it happens the entire cluster gets unstable for around 30 minutes and no new tasks are allocated despite the middle managers have free slots. Indeed the file being opened,
    .coordinator-issued_kil...
    had 265 bytes length and it exceeds the XFS limit of 255 bytes. Do you know any work around to forcibly kill those segments?
    Copy code
    2025-05-30T22:10:42,465 ERROR [qtp214761486-125] org.apache.druid.indexing.worker.WorkerTaskManager - Error while trying to persist assigned task[coordinator-issued_kill_<deducted_long_datasource_name_119_bytes>]
    java.nio.file.FileSystemException: var/tmp/persistent/task/workerTaskManagerTmp/.coordinator-issued_kill_<deducted_long_datasource_name_119_bytes>_dfhlgdae_2024-07-10T23:00:00.000Z_2024-07-18T00:00:00.000Z_2025-05-30T22:10:42.417Z.2aababbd-02a6-4002-9b9f-cba30bbea8a7: File name too long
    	at java.base/sun.nio.fs.UnixException.translateToIOException(UnixException.java:100) ~[?:?]
    	at java.base/sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:106) ~[?:?]
    	at java.base/sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:111) ~[?:?]
    	at java.base/sun.nio.fs.UnixFileSystemProvider.newFileChannel(UnixFileSystemProvider.java:181) ~[?:?]
    	at java.base/java.nio.channels.FileChannel.open(FileChannel.java:298) ~[?:?]
    	at java.base/java.nio.channels.FileChannel.open(FileChannel.java:357) ~[?:?]
    	at org.apache.druid.java.util.common.FileUtils.writeAtomically(FileUtils.java:271) ~[druid-processing-33.0.0.jar:33.0.0]
    ...
    a
    j
    • 3
    • 5
  • a

    Asit

    06/03/2025, 4:25 AM
    Hello , I am not able to query by segments table getting from metadata and also my segment tab in Druid Console is timing out . Anyway I can get the segment count or increase memory to get the segment information .
    g
    • 2
    • 2
  • j

    Jon Laberge

    06/04/2025, 5:35 AM
    👋 , I am trying to get a cluster up and running on GKE, my requirements are to use cloud storage for deepstorage/logs and use postgres/CloudSQL for metadata. I'm using the druidoperator alongside deploying my cluster, and using a zkless deployment, this partially works but I face the problem described in this issue. The most recent suggestion is to use
    kubernetes-overlord-extensions
    , however I see this error when the overlord is trying to start:
    Copy code
    Caused by: org.apache.commons.lang3.NotImplementedException: this druid.indexer.logs.type [class org.apache.druid.storage.google.GoogleTaskLogs] does not support managing task payloads yet. You will have to switch to using environment variables
    Is there something I should be changing in my task template?
    k
    • 2
    • 7
  • j

    Jimbo Slice

    06/06/2025, 9:55 PM
    Got a really strange issue guys, following query returns error: "Query results were truncated midstream. This may indicate a server-side error or a client-side issue. Try re-running your query using a lower limit."
    SELECT
    COUNT(*) As Entries,
    SUM(packets) as Packets,
    SUM(bytes) as Bytes,
    (SUM(bytes) / SUM(packets)) as AvgPacketSizeBytes,
    MIN(__time) as FirstSeen,
    MAX(__time) as LastSeen,
    TIMESTAMPDIFF(SECOND, MIN(__time), MAX(__time)) as DurationSeconds,
    (SUM(bytes) * 8 / TIMESTAMPDIFF(SECOND, MIN(__time), MAX(__time))) as AvgMbps,
    "pkt-srcaddr", "pkt-dstaddr", "protocol"
    FROM "AWSLogsVPC"
    WHERE "log-status"!='NODATA' AND "pkt-srcaddr"!='-' AND "action"='ACCEPT'
    GROUP BY "pkt-srcaddr", "pkt-dstaddr", "protocol"
    But when i remove the TIMESTAMPDIFF section for AvgMbps this does not happen:
    SELECT
    COUNT(*) As Entries,
    SUM(packets) as Packets,
    SUM(bytes) as Bytes,
    (SUM(bytes) / SUM(packets)) as AvgPacketSizeBytes,
    MIN(__time) as FirstSeen,
    MAX(__time) as LastSeen,
    TIMESTAMPDIFF(SECOND, MIN(__time), MAX(__time)) as DurationSeconds,
    (SUM(bytes) * 8) as AvgMbps,
    "pkt-srcaddr", "pkt-dstaddr", "protocol"
    FROM "AWSLogsVPC"
    WHERE "log-status"!='NODATA' AND "pkt-srcaddr"!='-' AND "action"='ACCEPT'
    GROUP BY "pkt-srcaddr", "pkt-dstaddr", "protocol"
    I've tried removing the "WHERE" because != is bad practice, no difference, I believe there is an issue here with subquerying (druid.server.http.maxsubqueryrows) - however this is not a subquery, this is a simple calculation in a simple query. This query runs perfectly without
    TIMESTAMPDIFF(SECOND, MIN(__time), MAX(__time))
    being called in AvgMbps. Any ideas on what could be wrong???
  • b

    Ben Krug

    06/06/2025, 10:07 PM
    I don't know whether it has to do with datasizes or timings, but I wonder whether the division is a problem somehow? Is DurationSeconds ever 0 or null? Just curious...
    j
    • 2
    • 8
  • v

    venkat

    06/07/2025, 8:50 AM
    👋 Hello, team!
  • v

    venkat

    06/07/2025, 8:54 AM
    Am having small dout actually we have 6historical services but in that one is not visible in druid console but when I check ps -ef |grep historical it showing as running but why am unable to see in the druid console..? Any idea
  • s

    sandy k

    06/09/2025, 4:43 AM
    Using coordinator api : http://x.x.x.x:8081/druid/coordinator/v1/servers?full, I am getting segments per data node for all servers show segments, except one with very low segments single digit segments. But when I restart this specific data node, it starts with showing loading count which higher. Say for server1: 2025-06-09T025235,172 INFO [main] org.apache.druid.server.coordination.SegmentLoadDropHandler - Loading segment cache file [1/33036][/data01/druid/segment vs For say server2 2025-06-08T182252,245 INFO [main] org.apache.druid.server.coordination.SegmentLoadDropHandler - Loading segment cache file [1/14294]. It shows up in UI with service running. This problematic datanode disconnects and also has zookeeper connectivity issues compared to other nodes. Is this node not part of cluster.
1...4950515253Latest