https://linen.dev logo
Join Slack
Powered by
# troubleshooting
  • s

    Sachin G

    09/25/2025, 4:31 AM
    Copy code
    !/bin/bash
    # First find the Imply init script 
    RUNDRUID=$(find /opt/grove -name run-druid | grep -v dist)
    # and add the desired environment variable before starting the Imply processes
    sed -i '/^exec.*/i export\ KAFKA_JAAS_CONFIG="org.apache.kafka.common.security.plain.PlainLoginModule  required username='\'123434\'' password='\'123\+abcdee\'';"' ${RUNDRUID}
  • s

    Sachin G

    09/25/2025, 4:32 AM
    when i use this hard coded credentials in the user-init script, i am able to use KAFKA_JAAS_CONFIG as a variable in my Kafka Ingestion job
  • s

    Sachin G

    09/25/2025, 4:33 AM
    but instead of hard coded credentials i want to use env variables, something like below ( i tried various scripts but no luck so far, this is just 1 example)
  • s

    Sachin G

    09/25/2025, 4:33 AM
    Copy code
    #!/bin/bash
    
    RUNDRUID=$(find /opt/grove -name run-druid | grep -v dist | head -n 1)
    if [ -z "$RUNDRUID" ]; then
      echo "run-druid script not found."
      exit 1
    fi
    
    
    sed -i "/^exec.*/i export KAFKA_JAAS_CONFIG=\"org.apache.kafka.common.security.plain.PlainLoginModule required username='${USERNAME}' password='${PASSWORD}';\"" "$RUNDRUID"
  • s

    Sachin G

    09/25/2025, 4:33 AM
    Note: Druid is running on Kubernetes (EKS)
  • s

    Sachin G

    09/25/2025, 4:33 AM
    sample snippet of Kafka job with this variable is as below
  • s

    Sachin G

    09/25/2025, 4:35 AM
    image.png
  • s

    Sachin G

    09/25/2025, 4:36 AM
    i have defined these variables (Username and Password) in Imply Mgr Pod, and also Druid Pods .. restarted the cluster
    i
    • 2
    • 29
  • s

    Sachin G

    09/25/2025, 4:36 AM
    but i get empty value
  • s

    Sachin G

    09/25/2025, 4:36 AM
    Copy code
    sudo -u grove cat /proc/71938/environ | tr '\0' '\n' | grep KAFKA_JAAS_CONFIG
    KAFKA_JAAS_CONFIG=org.apache.kafka.common.security.plain.PlainLoginModule required username='' password='';
  • d

    Danny Wilkins

    09/25/2025, 5:27 PM
    Hey y'all, potentially silly question. I've been trying to play with the task autoscaler but when I look at the supervisor payload in the druid console. Am I supposed to be seeing it in there? I'm also not seeing it scale tasks based on lag, but if I can just verify that the config exists in the console that'll be an easy first step to know I might've fixed it.
    k
    • 2
    • 5
  • d

    Danny Wilkins

    09/25/2025, 5:27 PM
    Rather than artificially inducing lag.
  • t

    Taoufiq Bahalla

    09/26/2025, 11:40 AM
    Hello all, I’m new to Druid and have a question about Kafka ingestion. We’re trying to set up Kafka ingestion in Druid so that all fields from both the Kafka key and value are included in the datasource. Right now, only the first field of the key is being ingested. I found this note in the documentation:
    “The input format to parse the Kafka key only processes the first entry of the
    inputFormat
    field. If your key values are simple strings, you can use the
    tsv
    format to parse them. Note that for
    tsv
    ,
    csv
    , and
    regex
    formats, you need to provide a
    columns
    array to make a valid input format. Only the first one is used, and its name will be ignored in favor of
    keyColumnName
    .”
    Did I miss something? Is there a way to ingest *all Kafka key fields as columns in the Druid datasource*—without copying them into the value? Thanks in advance!
  • r

    Richard Vernon

    09/30/2025, 1:25 PM
    Hello guys, having had our Druid cluster up and running without downtime for years. It's long overdue an upgrade and reconfiguration in terms of data storage/query efficiency however. As it stands there are just over 2.3B rows of event data, and I must say it's been handling the interactive analytics demands exceptionally well. But I would like to improve the storage/querying efficiency, by segmenting using secondary partitioning on our tenant ID/account_number column. We have data flowing in via a Kinesis Data Stream, and I remember in older versions of Druid (0.16), the approach would be to run an index_parallel task regularly with secondary partitioning. With Druid 34.0.0 however, I believe this can be done using auto-compaction? I tried setting up hashed partitioning on the account_number column, however it seems to be skipping every segment for some reason Just for debugging I set the skipOffsetFromLatest to 10M, and even with new segments being written out every 1000 rows, it still seems to be skipping them. The logs don't seem to indicate why exactly the segments aren't compactible: ./coordinator-overlord.log65272025-09-30T130114,507 WARN [Coordinator-Exec-IndexingServiceDuties-0] org.apache.druid.server.compaction.DataSourceCompactibleSegmentIterator - Skipping compaction for datasource[r3-event-stream] as it has no compactible segments. curl -s http://localhost:8081/druid/coordinator/v1/compaction/status?dataSource=r3-event-stream | jq . { "latestStatus": [ { "dataSource": "r3-event-stream", "scheduleStatus": "RUNNING", "message": null, "bytesAwaitingCompaction": 0, "bytesCompacted": 0, "bytesSkipped": 5587645, "segmentCountAwaitingCompaction": 0, "segmentCountCompacted": 0, "segmentCountSkipped": 46, "intervalCountAwaitingCompaction": 0, "intervalCountCompacted": 0, "intervalCountSkipped": 1 } ] } Hoping it's something simple I'm missing, thanks!
    k
    • 2
    • 13
  • j

    JRob

    10/01/2025, 1:39 PM
    Besides hilo query laning is there any other way to prevent really expensive queries from impacting the cluster?
    k
    j
    m
    • 4
    • 17
  • s

    Satya Kuppam

    10/02/2025, 1:39 PM
    Hello Folks, I am having trouble optimising Dart queries on the latest
    34.0.0
    version: • I have a query with a single JOIN and I keep running into
    Not enough memory
    issues (see 🧵 for the query, datasource and task run detail). • The query fails in the
    sortMergeJoin
    phase. We have two historical pods with 64vCPU and 512gigs of memory with -Xmx=107g. • From the Dart documentation its not clear how I can capacity plan for this query or if its possible to run this query successfully at all. ◦ does Dart spill to disk in the join phase? Would that potentially be the problem here?
    • 1
    • 2
  • j

    Jvalant Patel

    10/03/2025, 12:58 AM
    Hi, is there an easy way to add
    org.apache.druid.server.metrics.QueryCountStatsMonitor
    monitor for Peon ingestion processes running on Middle manager nodes ?
    s
    a
    • 3
    • 8
  • s

    Soman Ullah

    10/03/2025, 5:23 PM
    Is there a way to overwrite the last day of data using MSQ semantics? I tried this:
    REPLACE INTO "test-ds" OVERWRITE WHERE __time >= CURRENT_TIMESTAMP - INTERVAL '1' DAY
    but it gave the following error:
    Copy code
    Invalid OVERWRITE WHERE clause [`__time` >= CURRENT_TIMESTAMP - INTERVAL '1' DAY]: Cannot get a timestamp from sql expression [CURRENT_TIMESTAMP - INTERVAL '1' DAY]
    a
    • 2
    • 4
  • d

    Danny Wilkins

    10/06/2025, 3:26 PM
    Hey y'all, I'm tearing my head out over this issue. I'm running on druid 28 and I tried enabling autoscaling, however after enable autoscaling I have an ingestion topic which refuses to scale beyond 1 task. I've removed the autoscaling config, tried to manually set the task count, restarted the instances, replaced the instances, knocked out all of the instances and brought them back up, nothing's letting this get past 1 task. Should I be looking in Zookeeper or something?
    j
    j
    • 3
    • 14
  • j

    JRob

    10/07/2025, 7:55 PM
    Has anyone gotten Protobuf to work with Schema Registry? I'm struggling to get past the following error:
    Copy code
    Cannot construct instance of `org.apache.druid.data.input.protobuf.SchemaRegistryBasedProtobufBytesDecoder`, problem: io/confluent/kafka/schemaregistry/protobuf/ProtobufSchemaProvider
    The instructions here seem to be wrong: https://druid.apache.org/docs/latest/development/extensions-core/protobuf/#when-using-schema-registry (there is no extension-core folder and creating it didn't fix the issue) I also tried placing the jars to
    extensions/protobuf-extensions
    and
    extensions/druid-protobuf-extensions
    but still no luck...
    b
    • 2
    • 21
  • s

    Sanjay Dowerah

    10/08/2025, 8:06 AM
    Hello Druid Community, Apologies for repeating this, just wanted to keep the loop alive, I am running Druid on an Openshift cluster, and using the Druid Delta Lake extension(https://github.com/apache/druid/tree/master/extensions-contrib/druid-deltalake-extensions) to connect and load Delta tables. However, I am running into the following issue, • error while loading with delta connector: only 1024 records of each constituent parquet file(each partition of the delta table) is loaded Also, there is an an error on the UI as soon as the load is over • ERROR: Request failed with status code 404 For your reference, here is the query I am using to load, REPLACE INTO "table" OVERWRITE ALL WITH "ext" AS ( SELECT * FROM TABLE( EXTERN( '{"type":"delta","tablePath":"path"}', '{"type":"parquet"}' ) ) EXTEND ("col1" VARCHAR, "col2" VARCHAR, "col3" VARCHAR, "col4" BIGINT, "col5" VARCHAR, "col6" VARCHAR, "col7" BIGINT, "col8" VARCHAR, "col9" VARCHAR) ) SELECT MILLIS_TO_TIMESTAMP("dop" * 1000) AS "__time", "col1", "col2", "col3", "col4", "col5", "col6", "col7", "col8" FROM "ext" PARTITIONED BY DAY
  • m

    Maytas Monsereenusorn

    10/11/2025, 8:42 PM
    Is there a way in MSQE to clustered by all dimension (without listing all the dimensions) similar to how in ingestionSpec we can leave partitionDimensions in Hash-based partitioning to null?
    k
    • 2
    • 4
  • a

    Adithya Shetty

    10/13/2025, 6:30 PM
    Hi, we are trying to setup MSQ queries. We are using druid version 29.
  • a

    Adithya Shetty

    10/13/2025, 6:37 PM
    When running MSQ queries we are not seeing any issues in various stages, including
    exportResults
    , but notice that files exported to s3 have truncated rows. When same MSQ queries are configured to export results locally, we are able to see all rows in all files. Below are druid properties configured for MSQ setup. Please check and provide pointers to fix the issue. Thanks in advance
    Copy code
    druid.extensions.loadList:  '["/opt/druid/extensions/maha-druid-lookups","/opt/druid/extensions/druid-datasketches", "/opt/druid/extensions/druid-avro-extensions","/opt/druid/extensions/druid-s3-extensions", "/opt/druid/extensions/mysql-metadata-storage", "/opt/druid/extensions/druid-kafka-indexing-service","/opt/druid/extensions/druid-orc-extensions","/opt/druid/extensions/druid-multi-stage-query","/opt/druid/extensions/statsd-emitter"]'
      "druid.msq.intermediate.storage.enable": "true",
      "druid.msq.intermediate.storage.type": "s3",
      "druid.msq.intermediate.storage.bucket": "{{aws_account_id}}-demand-reporting-druid-prod-{{ aws_region}}",
      "druid.msq.intermediate.storage.prefix": "reports",
      druid.msq.intermediate.storage.tempDir: '/data/msq'
      druid.export.storage.s3.tempLocalDir: '/tmp/msq'
      druid.export.storage.s3.allowedExportPaths: '["<s3://demand-reporting-asyncreports-druid-prod/export/>"]'
      druid.export.storage.s3.chunkSize: 500MiB
    j
    k
    • 3
    • 11
  • r

    Richard Vernon

    10/14/2025, 4:17 PM
    Hello guys, would anyone know how best to perform a whole data source/segment migration from a 0.16.0 cluster (local storage) > 34.0.0 (s3 storage)? I thought of directly loading the segment + metadata from old > new, however the metadata formatting and compression encoding has changed significantly since (or at least as far as I'm aware) Thanks!
    d
    • 2
    • 4
  • v

    Victoria

    10/21/2025, 5:51 PM
    Hey folks. I'm looking into some recommendations or advise. Given: Druid cluster v 33. Batch index_parallel ingestion tasks that insert data with rollup enabled and hashed partitioning on site_id dimension (high cardinality). Query granularity: Day, Segment granularity: Day. During ingestion we don't specify the number of shards (The number of segments calculated automatically) Issue: We noticed hot segments after the ingestion completed. For example, we inserted 5 days of data, and each day has one segment with over 40 millions of rows (over 1GiB). First I thought, that we have some many events for that particular site_id, but when I previewed rows in those oversized segments I noticed there are more than one site_id in them. What are you strategies to avoid hot segments if you leverage hashed partition on one dimension? We used one dimension because of the expected queries filter (where clause will always have site_id)
  • m

    Maytas Monsereenusorn

    10/25/2025, 12:54 AM
    Why do MSQE Sql based ingestion limits to 25,000 partitions? i.e. if I want to ingest a huge table and have good segment size, then the above limit fails the job
    j
    k
    k
    • 4
    • 6
  • w

    Wony

    10/28/2025, 3:06 PM
    Hi everyone, I'm new to Druid and have run into a strange data discrepancy that I'm hoping to get some help with. 🥲 It seems that adding an
    ORDER BY
    clause to my
    GROUP BY
    aggregation query is incorrectly changing the calculated results. The Scenario: I am running a query to count total, skipped, and responded answers for a set of questions. Query without
    ORDER BY
    (Correct Results):
    When I run the aggregation without sorting, I get the expected results. SQL
    Copy code
    SELECT
      question_id AS QUESTION_ID,
      COUNT(*) count_row,
      SUM(count_responses) AS TOTAL_RESPONSES,
      SUM(
        CASE
          WHEN option_id IS NULL OR option_id = '' THEN count_responses
          ELSE 0
        END
      ) AS SKIPPED,
      SUM(
        CASE
          WHEN option_id IS NOT NULL AND option_id != '' THEN count_responses
          ELSE 0
        END
      ) AS RESPONDED
    FROM daily_ceu_question_response
    WHERE
      account_id = 'dffdc481-a01f-4051-8d3b-971a925bae14'
      AND event_date >= '2025-09-01'
      AND event_date <= '2025-09-30'
    GROUP BY
      question_id
    For a specific
    question_id
    , the output is correct: • `count_row`: 6 • `TOTAL_RESPONSES`: 6 • `SKIPPED`: 3 • `RESPONDED`: 3 Query with
    ORDER BY
    (Incorrect Results):
    However, when I add
    ORDER BY SKIPPED DESC
    to the end of the exact same query, the results for that specific
    question_id
    become incorrect: SQL
    Copy code
    -- Same query as above, with this line added at the end:
    ORDER BY SKIPPED DESC
    The output for the same
    question_id
    (
    066a8c94-...-bac7d1
    ) changes to: • `count_row`: 5 • `TOTAL_RESPONSES`: 5 • `SKIPPED`: 3 • `RESPONDED`: 2 One of the "responded" rows seems to disappear from the aggregation, causing the counts to be wrong. This behavior seems like a bug, as an
    ORDER BY
    clause should only sort the final result set. Is this a known issue, or something I should open a bug report for on GitHub? Any guidance would be much appreciated ❤️. I am using Druid version 28.0.1 I've attached the three screenshots showing the raw data and the different query results. Thanks for your help!
  • j

    JRob

    10/30/2025, 4:50 PM
    The only time I've seen inconsistency in the data is when a segment wasn't loaded at the time of the query. Another thing that can provide additional insight is to check the Explain for your queries.
  • u

    Utkarsh Chaturvedi

    10/31/2025, 4:26 AM
    Hi everyone, I'm trying to understand the exact behavior when
    tieredReplicants
    is set higher than the number of historicals in a tier. Setup example: • Tier has 3 historicals • Datasource configured with
    tieredReplicants: 5
    Question: What actually happens in this case? 1. Does Druid cap the replicas at 3 (one per historical)? 2. Can a single historical load multiple copies of the same segment to satisfy the replication factor? 3. Does it fail/warn/queue the additional replicas? I couldn't find explicit documentation about this edge case. The architecture seems designed to distribute segments across different historicals, but I want to confirm the actual behavior when requested replicas exceed available nodes. Has anyone tested this scenario or can point me to the relevant code/docs that clarifies this? Thanks!
    m
    • 2
    • 6