https://linen.dev logo
Join Slack
Powered by
# troubleshooting
  • a

    Adithya Shetty

    10/13/2025, 6:37 PM
    When running MSQ queries we are not seeing any issues in various stages, including
    exportResults
    , but notice that files exported to s3 have truncated rows. When same MSQ queries are configured to export results locally, we are able to see all rows in all files. Below are druid properties configured for MSQ setup. Please check and provide pointers to fix the issue. Thanks in advance
    Copy code
    druid.extensions.loadList:  '["/opt/druid/extensions/maha-druid-lookups","/opt/druid/extensions/druid-datasketches", "/opt/druid/extensions/druid-avro-extensions","/opt/druid/extensions/druid-s3-extensions", "/opt/druid/extensions/mysql-metadata-storage", "/opt/druid/extensions/druid-kafka-indexing-service","/opt/druid/extensions/druid-orc-extensions","/opt/druid/extensions/druid-multi-stage-query","/opt/druid/extensions/statsd-emitter"]'
      "druid.msq.intermediate.storage.enable": "true",
      "druid.msq.intermediate.storage.type": "s3",
      "druid.msq.intermediate.storage.bucket": "{{aws_account_id}}-demand-reporting-druid-prod-{{ aws_region}}",
      "druid.msq.intermediate.storage.prefix": "reports",
      druid.msq.intermediate.storage.tempDir: '/data/msq'
      druid.export.storage.s3.tempLocalDir: '/tmp/msq'
      druid.export.storage.s3.allowedExportPaths: '["<s3://demand-reporting-asyncreports-druid-prod/export/>"]'
      druid.export.storage.s3.chunkSize: 500MiB
    j
    k
    • 3
    • 11
  • r

    Richard Vernon

    10/14/2025, 4:17 PM
    Hello guys, would anyone know how best to perform a whole data source/segment migration from a 0.16.0 cluster (local storage) > 34.0.0 (s3 storage)? I thought of directly loading the segment + metadata from old > new, however the metadata formatting and compression encoding has changed significantly since (or at least as far as I'm aware) Thanks!
    d
    • 2
    • 4
  • v

    Victoria

    10/21/2025, 5:51 PM
    Hey folks. I'm looking into some recommendations or advise. Given: Druid cluster v 33. Batch index_parallel ingestion tasks that insert data with rollup enabled and hashed partitioning on site_id dimension (high cardinality). Query granularity: Day, Segment granularity: Day. During ingestion we don't specify the number of shards (The number of segments calculated automatically) Issue: We noticed hot segments after the ingestion completed. For example, we inserted 5 days of data, and each day has one segment with over 40 millions of rows (over 1GiB). First I thought, that we have some many events for that particular site_id, but when I previewed rows in those oversized segments I noticed there are more than one site_id in them. What are you strategies to avoid hot segments if you leverage hashed partition on one dimension? We used one dimension because of the expected queries filter (where clause will always have site_id)
  • m

    Maytas Monsereenusorn

    10/25/2025, 12:54 AM
    Why do MSQE Sql based ingestion limits to 25,000 partitions? i.e. if I want to ingest a huge table and have good segment size, then the above limit fails the job
    j
    k
    k
    • 4
    • 6
  • w

    Wony

    10/28/2025, 3:06 PM
    Hi everyone, I'm new to Druid and have run into a strange data discrepancy that I'm hoping to get some help with. 🥲 It seems that adding an
    ORDER BY
    clause to my
    GROUP BY
    aggregation query is incorrectly changing the calculated results. The Scenario: I am running a query to count total, skipped, and responded answers for a set of questions. Query without
    ORDER BY
    (Correct Results):
    When I run the aggregation without sorting, I get the expected results. SQL
    Copy code
    SELECT
      question_id AS QUESTION_ID,
      COUNT(*) count_row,
      SUM(count_responses) AS TOTAL_RESPONSES,
      SUM(
        CASE
          WHEN option_id IS NULL OR option_id = '' THEN count_responses
          ELSE 0
        END
      ) AS SKIPPED,
      SUM(
        CASE
          WHEN option_id IS NOT NULL AND option_id != '' THEN count_responses
          ELSE 0
        END
      ) AS RESPONDED
    FROM daily_ceu_question_response
    WHERE
      account_id = 'dffdc481-a01f-4051-8d3b-971a925bae14'
      AND event_date >= '2025-09-01'
      AND event_date <= '2025-09-30'
    GROUP BY
      question_id
    For a specific
    question_id
    , the output is correct: • `count_row`: 6 • `TOTAL_RESPONSES`: 6 • `SKIPPED`: 3 • `RESPONDED`: 3 Query with
    ORDER BY
    (Incorrect Results):
    However, when I add
    ORDER BY SKIPPED DESC
    to the end of the exact same query, the results for that specific
    question_id
    become incorrect: SQL
    Copy code
    -- Same query as above, with this line added at the end:
    ORDER BY SKIPPED DESC
    The output for the same
    question_id
    (
    066a8c94-...-bac7d1
    ) changes to: • `count_row`: 5 • `TOTAL_RESPONSES`: 5 • `SKIPPED`: 3 • `RESPONDED`: 2 One of the "responded" rows seems to disappear from the aggregation, causing the counts to be wrong. This behavior seems like a bug, as an
    ORDER BY
    clause should only sort the final result set. Is this a known issue, or something I should open a bug report for on GitHub? Any guidance would be much appreciated ❤️. I am using Druid version 28.0.1 I've attached the three screenshots showing the raw data and the different query results. Thanks for your help!
    k
    • 2
    • 2
  • j

    JRob

    10/30/2025, 4:50 PM
    The only time I've seen inconsistency in the data is when a segment wasn't loaded at the time of the query. Another thing that can provide additional insight is to check the Explain for your queries.
  • u

    Utkarsh Chaturvedi

    10/31/2025, 4:26 AM
    Hi everyone, I'm trying to understand the exact behavior when
    tieredReplicants
    is set higher than the number of historicals in a tier. Setup example: • Tier has 3 historicals • Datasource configured with
    tieredReplicants: 5
    Question: What actually happens in this case? 1. Does Druid cap the replicas at 3 (one per historical)? 2. Can a single historical load multiple copies of the same segment to satisfy the replication factor? 3. Does it fail/warn/queue the additional replicas? I couldn't find explicit documentation about this edge case. The architecture seems designed to distribute segments across different historicals, but I want to confirm the actual behavior when requested replicas exceed available nodes. Has anyone tested this scenario or can point me to the relevant code/docs that clarifies this? Thanks!
    m
    • 2
    • 6
  • m

    Mahesha Subrahamanya

    11/06/2025, 5:22 AM
    Hello Team, I'm trying to integrate with Iceberg (AWS Glue catalog). I just followed with this documentation - https://druid.apache.org/docs/latest/ingestion/input-sources#iceberg-input-source Trying to run Druid 31 version in my local. following errors being threw so please let me know if any thing i need to configure. druid.extensions.loadList=["druid-hdfs-storage", "druid-iceberg-extensions", "druid-kafka-indexing-service", "druid-datasketches", "druid-multi-stage-query", "druid-histogram", "druid-lookups-cached-global" ,"druid-s3-extensions", "druid-parquet-extensions"] apache-druid-31.0.0 % ./bin/start-druid [Wed Nov 5 211111 2025] Starting Apache Druid. [Wed Nov 5 211111 2025] Open http://localhost:8888/ in your browser to access the web console. [Wed Nov 5 211111 2025] Or, if you have enabled TLS, use https on port 9088. [Wed Nov 5 211111 2025] Starting services with log directory [~/Downloads/apache-druid-31.0.0/log]. [Wed Nov 5 211111 2025] Running command[zk]: bin/run-zk conf [Wed Nov 5 211111 2025] Running command[broker]: bin/run-druid broker ~/Downloads/apache-druid-31.0.0/conf/druid/auto '-Xms7621m -Xmx7621m -XX:MaxDirectMemorySize=5081m' [Wed Nov 5 211111 2025] Running command[router]: bin/run-druid router ~/Downloads/apache-druid-31.0.0/conf/druid/auto '-Xms552m -Xmx552m -XX:MaxDirectMemorySize=128m' [Wed Nov 5 211111 2025] Running command[coordinator-overlord]: bin/run-druid coordinator-overlord ~/Downloads/apache-druid-31.0.0/conf/druid/auto '-Xms8284m -Xmx8284m' [Wed Nov 5 211111 2025] Running command[historical]: bin/run-druid historical ~/Downloads/apache-druid-31.0.0/conf/druid/auto '-Xms8836m -Xmx8836m -XX:MaxDirectMemorySize=13255m' [Wed Nov 5 211111 2025] Running command[middleManager]: bin/run-druid middleManager ~/Downloads/apache-druid-31.0.0/conf/druid/auto '-Xms512m -Xmx512m' '-Ddruid.worker.capacity=3 -Ddruid.indexer.runner.javaOptsArray=["-server","-Duser.timezone=UTC","-Dfile.encoding=UTF-8","-XX:+ExitOnOutOfMemoryError","-Djava.util.logging.manager=org.apache.logging.log4j.jul.LogManager","-Xms341m","-Xmx341m","-XX:MaxDirectMemorySize=341m"]' [Wed Nov 5 211112 2025] Command[router] exited (pid = 75021, exited = 1) [Wed Nov 5 211112 2025] Command[router] failed, see its logfile for more details [Wed Nov 5 211112 2025] Command[coordinator-overlord] exited (pid = 75022, exited = 1) [Wed Nov 5 211112 2025] Command[coordinator-overlord] failed, see its logfile for more details [Wed Nov 5 211112 2025] Command[middleManager] exited (pid = 75024, exited = 1) [Wed Nov 5 211112 2025] Command[middleManager] failed, see its logfile for more details [Wed Nov 5 211112 2025] Command[historical] exited (pid = 75023, exited = 1) [Wed Nov 5 211112 2025] Command[historical] failed, see its logfile for more details [Wed Nov 5 211112 2025] Command[broker] exited (pid = 75020, exited = 1) [Wed Nov 5 211112 2025] Command[broker] failed, see its logfile for more details
    i
    • 2
    • 2
  • a

    Abdullah Velioğlu

    11/07/2025, 6:34 AM
    Hi everyone, I'm facing a recurring issue in our Druid cluster (version 28.0.1). Every few months (around every 3–4 months), the cluster becomes unresponsive — it stops responding to queries and even the web console becomes inaccessible. When this happens, I see errors like the following in the logs:
    Copy code
    (org.jboss.netty.handler.timeout.ReadTimeoutException: [GET <http://scba-live-druid-worker-prod-0:8100/druid-internal/v1/segments?counter=2&hash=1762438725728&timeout=240000>] Read timed out)
    
    com.fasterxml.jackson.core.JsonParseException: Invalid type marker byte 0x3c for expected value token
     at [Source: (SequenceInputStream); line: -1, column: 1]
    From what I can tell, it seems like one of the segments being fetched is somehow corrupted, which causes the query or internal communication to hang. However, I haven’t been able to identify why or how that segment becomes corrupted. Interestingly, if I reindex the same data into a new datasource, everything works fine — the issue does not reproduce with the same input data. Has anyone encountered a similar issue or investigated something like this before? Any tips on how to debug or identify the root cause (corrupted segment, metadata inconsistency, etc.) would be really appreciated. Thanks in advance for any insights!
    b
    • 2
    • 2
  • d

    David Alexander

    11/14/2025, 6:37 PM
    Hey everyone, For context - running Druid 31.0.2 - am looking at moving a dataset to use ARRAY columns as opposed to MVDs. One very common use case we have is doing a GROUP BY on the MVD column in which the elements are of high cardinality - as a simple example:
    Copy code
    SELECT "tags", sum("count") FROM "array_example" GROUP BY 1
    When
    tags
    is an MVD column it returns the results well within SLA. In following docs for MVD to ARRAY use, when I attempt a CROSS JOIN UNNEST like this:
    Copy code
    SELECT "tag", sum("count") FROM "array_example" CROSS JOIN UNNEST("tags") as "tag" GROUP BY 1
    It runs significantly slower, and when there's a high volume of unique tags it results in a timeout. One thing interesting is that I notice if I convert the ARRAY to MVD at query time, like:
    Copy code
    SELECT array_to_mv("tags"), sum("count") FROM "array_example" GROUP BY 1
    it completes within SLA, though from test calls not as fast as when the data is already in MVD form (which is expected). Just a couple questions: 1. Anything off with how I am doing the group by cross join? Also does the array_to_mv() usage provide a functionally correct alternative? 2. Is this performance difference expected when doing a group by on ARRAY elements? Thanks in advance!
  • e

    Etisha Jain

    11/18/2025, 7:32 AM
    Hello Everyone Has anyone work on reading the datafrom Kafka protobuf from druid. and getting multiple error Can someone get on a call to help me to fix it ?? its bit urgent
  • e

    Etisha Jain

    11/18/2025, 7:33 AM
    please msg me
  • u

    吴花露

    11/24/2025, 5:21 PM
    Could someone help take a look at this PR? Our company is very eager to have it merged into master as soon as possible. https://github.com/apache/druid/pull/18750
  • j

    Joa Ebert

    11/25/2025, 1:28 PM
    Commit https://github.com/apache/druid/commit/b517c3339b4f4ce79c5ff62ae1eb045f795e8cb6 excludes netty's native epoll. When using TLS between ZK and Druid we run into
    Copy code
    Caused by: java.lang.ClassNotFoundException: io.netty.channel.epoll.Epoll
            at org.apache.zookeeper.common.NettyUtils.newNioOrEpollEventLoopGroup(NettyUtils.java:74) ~[zookeeper-3.8.4.jar:3.8.4]
            at org.apache.zookeeper.ClientCnxnSocketNetty.<init>(ClientCnxnSocketNetty.java:87) ~[zookeeper-3.8.4.jar:3.8.4]
    Which is not surprising, given aforementioned dependency has been excluded. Are we supposed to use a client connection factory other than
    org.apache.zookeeper.ClientCnxnSocketNetty
    ?
  • e

    Etisha Jain

    11/26/2025, 8:13 AM
    Hello Everyone Has anyone work on reading the datafrom Kafka protobuf from druid. and getting multiple error Can someone get on a call to help me to fix it ?? its bit urgent
  • e

    Etisha Jain

    11/27/2025, 6:40 AM
    This the MIDDLE manager config which we are passing But getting some issue in Memory in some pods of MM Please help me to fix this issue
    Copy code
    middleManager:
      ## If false, middleManager will not be installed
      ##
      metricsName: metrics
      metricsPort: 9200
      enabled: true
      name: middle-manager
      replicaCount: 4
      port: 8091
    
      config:
        druid_node_type: 'middleManager'
        druid_worker_capacity: 6
        druid_worker_baseTaskDirs: '["/opt/druid/var/druid/worker_task_baseDir"]'
        druid_worker_baseTaskDirSize: '20000000000'
        DRUID_XMX: 8G
        DRUID_XMS: 8G
        DRUID_MAXDIRECTMEMORYSIZE: 16g
        druid_processing_buffer_sizeBytes: '200000000'
        druid_processing_numMergeBuffers: 3
        druid_processing_numThreads: 6
        druid_server_http_numThreads: 250
        druid_indexer_runner_javaOptsArray: '["-server", "-Xms2g", "-Xmx4g", "-XX:MaxDirectMemorySize=8g", "-Duser.timezone=UTC", "-Dfile.encoding=UTF-8", "-XX:+ExitOnOutOfMemoryError", "-Djava.util.logging.manager=org.apache.logging.log4j.jul.LogManager"]'
        druid_indexer_fork_property_druid_processing_buffer_sizeBytes: '200000000'
        druid_indexer_fork_property_druid_processing_numMergeBuffers: 3
        druid_indexer_fork_property_druid_processing_numThreads: 6
        druid_indexer_fork_property_druid_server_http_numThreads: 70
        #new changes
        druid_indexer_task_baseDir: "/opt/druid/var/druid/task_baseDir"
        druid_indexer_task_baseTaskDir: "/opt/druid/var/druid/baseTaskDir"
        druid_indexer_task_restoreTasksOnRestart: true
        druid_processing_tmpDir: "/opt/druid/var/druid/tmpDir"
        druid_indexer_storage_type: "metadata"
        druid_realtime_cache_useCache: true
        druid_realtime_cache_populateCache: true
        druid_cache_type: caffeine
        #groupby
        druid_query_groupBy_maxMergingDictionarySize: "500000000"
        druid_query_groupBy_maxOnDiskStorage: "10000000000"
    
        #autoscaling task
        druid_indexer_autoscale_strategy: "ec2"
        druid_indexer_autoscale_doAutoscale: true
    
        druid_monitoring_monitors: '["org.apache.druid.java.util.metrics.JvmMonitor","org.apache.druid.java.util.metrics.JvmCpuMonitor","org.apache.druid.java.util.metrics.JvmThreadsMonitor"]'
        #druid_monitoring_monitors: '["org.apache.druid.client.cache.CacheMonitor", "org.apache.druid.java.util.metrics.JvmMonitor", "org.apache.druid.java.util.metrics.CpuAcctDeltaMonitor", "org.apache.druid.java.util.metrics.JvmThreadsMonitor", "org.apache.druid.server.metrics.EventReceiverFirehoseMonitor"]'
    k
    • 2
    • 1
  • v

    Vineeth

    11/27/2025, 9:20 AM
    Hi everyone, We have around 150 dimensions in our Druid cluster. Only three columns account for roughly 45% of the total storage volume. After analysing the dimension-level storage usage, we found that these three columns use the hyperUnique data type. Is there a way to reduce the storage consumed by columns with the hyperUnique data type?
    k
    • 2
    • 2
  • j

    JRob

    11/28/2025, 9:19 PM
    Hello, I have Task Logging enable via:
    Copy code
    druid.indexer.logs.type=file
    druid.indexer.logs.directory=/data/indexing-logs
    druid.indexer.logs.kill.enabled=true
    druid.indexer.logs.kill.durationToRetain=2592000000
    However, I'm noticing that I only have 24 hours of task logs. Am I missing something?
    j
    • 2
    • 7
  • s

    Shivam Jain

    12/01/2025, 11:23 AM
    Hi team, I am using Druid 24.0.0 on my local. I have defined a compaction rule on my datasource. Though setting the
    Skip offset from latest
    as PT0S, druid is not auto compacting the segments. After manual server restart Compaction is happening. My compaction config looks like:
    Copy code
    {
      "dataSource": "shivam_test_with_diff_dates",
      "taskPriority": 25,
      "inputSegmentSizeBytes": 100000000000000,
      "maxRowsPerSegment": 5000000,
      "skipOffsetFromLatest": "PT0S",
      "tuningConfig": {
        "maxRowsInMemory": null,
        "appendableIndexSpec": null,
        "maxBytesInMemory": null,
        "maxTotalRows": null,
        "splitHintSpec": null,
        "partitionsSpec": {
          "type": "dynamic",
          "maxRowsPerSegment": 5000000,
          "maxTotalRows": null
        },
        "indexSpec": null,
        "indexSpecForIntermediatePersists": null,
        "maxPendingPersists": null,
        "pushTimeout": null,
        "segmentWriteOutMediumFactory": null,
        "maxNumConcurrentSubTasks": null,
        "maxRetry": null,
        "taskStatusCheckPeriodMs": null,
        "chatHandlerTimeout": null,
        "chatHandlerNumRetries": null,
        "maxNumSegmentsToMerge": null,
        "totalNumMergeTasks": null,
        "maxColumnsToMerge": null,
        "type": "index_parallel",
        "forceGuaranteedRollup": false
      },
      "granularitySpec": {
        "segmentGranularity": "DAY",
        "queryGranularity": null,
        "rollup": false              
      "dimensionsSpec": null,
      "metricsSpec": null,
      "transformSpec": null,
      "ioConfig": null,
      "taskContext": null
      }
    }
    What is wrong. Can someone plz help.
    p
    j
    • 3
    • 4
  • a

    A.Iswariya

    12/04/2025, 7:13 AM
    Hi team, I have an existing Druid table and I want to update or delete specific rows/columns. Could anyone share the best approach or examples for doing this?
    k
    • 2
    • 2
  • a

    A.Iswariya

    12/04/2025, 9:39 AM
    Hi team, I have an existing Druid datasource and I want to update or delete specific rows/columns from Python code. Could anyone share the best approach or examples for doing this?
    r
    • 2
    • 2
  • d

    Danny Wilkins

    12/05/2025, 2:20 PM
    This morning I came in to discover that during a deploy trying to turn on ingestion task autoscaling, druid deleted supervisors. Does anyone have a clue why that could happen? afaik taskCountMin wasn't even set to 0 (we do some python pre-templating and have it set more or less to be
    math.ceil(int(configuredTaskCount * 0.75))
  • d

    Danny Wilkins

    12/05/2025, 2:58 PM
    ^ With regard to this, I figured it out. I accidentally had the supervisor spec with taskCountMin > taskCountMax, which resulted in the supervisor being deleted for having an invalid config. That seems like a great way to cause an outage compared to, say, using the last known good config. Would people be open to a ticket for this?
  • e

    Etisha Jain

    12/08/2025, 6:13 AM
    Hi Team, We are running a
    GROUP BY
    query on our rollup table, but the execution time is currently around 25–30 seconds. Is there any way to optimize this so that we can achieve sub-second or 3–5 second latency, even with 15–20 concurrent queries running in parallel?
    • 1
    • 1
  • e

    Etisha Jain

    12/08/2025, 6:20 AM
    1. Would switching to
    topN
    be the primary recommendation here for speed? 2. Are there specific Broker/Historical memory tuning parameters (processing threads/buffers) that help specifically with concurrency? 3. Does enabling
    vectorize: "force"
    usually help with Theta Sketches, or should we avoid it?
  • l

    Lionel Mena

    12/12/2025, 2:54 PM
    How is this streaming task auto restart feature suppose to work ? I enabled it but despite having set a graceful termination of 6min (k8s pod termination) the tasks are still marked as failed and the supervisor spawns new tasks on other MMs nodes that re-reading everything from the failed task. I'm running druid v34 on k8s with canonical hostnames enabled.
  • m

    Mahesha Subrahamanya

    12/13/2025, 4:46 PM
    Untitled
    Untitled
  • m

    Mahesha Subrahamanya

    12/13/2025, 4:47 PM
    Hi team, I am using Druid 34.0.0. We are using Iceberg, AWS Glue with Druid ingestion spec. The problem we are noticing is ingesting just "1000" records is taking almost 5 mins. We feels it's too expensive and not sure, we are fully utilizing the druid optimization properly or not. If Druid itself is NOT supporting Iceberg, with AWS Glue data load into Druid which is causing this performance issue. We are in the process running heavy load (100 millions) and so worried to use Druid or not with data ingestion process because of the time it takes. Kindly review if anybody has any suggestions or recommendation is highly appreciated. Thank you so much. My Ingestion config looks like:
    j
    • 2
    • 12
  • j

    JRob

    12/15/2025, 4:02 PM
    I seem to be getting a lot of logs like this:
    Copy code
    2025-12-15T15:06:29,572 ERROR [Coordinator-Exec-HistoricalManagementDuties-0] org.apache.druid.server.coordinator.ServerHolder - Load queue for server [druid-data13:8083], tier [_default_tier] has [390] segments stuck.: {class=org.apache.druid.server.coordinator.ServerHolder, segments=[REPLICATE{segment=redacted, runsInQueue=182}, REPLICATE{segment=redacted, runsInQueue=77}, ...], maxLifetime=60}
    Different data hosts, different datasources. Anybody else know how to resolve this?
    ✅ 1
    j
    • 2
    • 8
  • v

    Vineeth

    12/17/2025, 8:25 AM
    Hello Druid Experts, Has anyone performed benchmarks specifically for the “Virtual Storage” feature in Druid version 35? I’m interested in understanding the query latency metrics when Virtual Storage is enabled. Also, since this feature is marked as experimental in version 35, is it expected to be production-ready in the next release?
    j
    • 2
    • 1