Rohen
06/09/2025, 5:53 AMRushikesh Bankar
06/09/2025, 10:32 AMJRob
06/10/2025, 3:19 PMsys.segments
are taking upwards of 60 seconds on average. Likewise our Datasources tab in the Console takes an agonizingly long time to load. But I can't understand why it's so slow, our DB stats don't show any issues.
The druid_segments
table is only 1108 MB in size.
From pg_stat_statements:
query | SELECT payload FROM druid_segments WHERE used=$1
calls | 734969
total_exec_time | 1318567198.0990858
min_exec_time | 733.308662
max_exec_time | 13879.650989
mean_exec_time | 1794.0446441947086
stddev_exec_time | 581.4299142612549
----------------------------------------------
query | SELECT payload FROM druid_segments WHERE used = $1 AND dataSource = $2 AND ((start < $3 AND "end" > $4) OR (start = $7 AND "end" != $8 AND "end" > $5) OR (start != $9 AND "end" = $10 AND start < $6) OR (start = $11 AND "end" = $12))
calls | 4888478
total_exec_time | 31912869.00381691
min_exec_time | 0.007730999999999999
max_exec_time | 2166.647028
mean_exec_time | 6.528180960171064
stddev_exec_time | 25.333075336970094
----------------------------------------------
Dinesh
06/12/2025, 4:55 AMDinesh
06/12/2025, 5:26 AMRiccardo Sale
06/16/2025, 10:11 AMdruid.audit.manager.maxPayloadSizeBytes
Looking at the coordinator.compaction.config
field we have seen that this json payload value have grown to over 30MB and it's still causing slowdown when queried.
As an example the following query: SELECT payload FROM druid_segments WHERE used=?
takes up to three second.
Any suggestion to solve the above issue ? How can we reduce the general size of the payload in the coordinator.compaction.config
? Would it be possible to write a custom extension for this specific use case ?
Thanks in advance !Rajesh Gottapu
06/17/2025, 5:19 AMNir Bar On
06/17/2025, 11:06 AMNir Bar On
06/17/2025, 12:30 PMNir Bar On
06/17/2025, 12:48 PMCristi Aldulea
06/18/2025, 7:46 AMingestionTimestamp
to support a deduplication job. Additionally, I have a column named tags
, which is a multi-value VARCHAR
column.
The deduplication is performed using an MSQ (Multi-Stage Query) like the following:
REPLACE INTO "target-datasource"
OVERWRITE
WHERE "__time" >= TIMESTAMP'__MIN_TIME'
AND "__time" < TIMESTAMP'__MAX_TIME'
SELECT
__time,
LATEST_BY("entityId", MILLIS_TO_TIMESTAMP("ingestionTimestamp")) AS "entityId",
LATEST_BY("entityName", MILLIS_TO_TIMESTAMP("ingestionTimestamp")) AS "entityName",
LATEST_BY("tagSetA", MILLIS_TO_TIMESTAMP("ingestionTimestamp")) AS "tagSetA",
LATEST_BY("tagSetB", MILLIS_TO_TIMESTAMP("ingestionTimestamp")) AS "tagSetB",
MAX("ingestionTimestamp") AS ingestionTimestamp
FROM "target-datasource"
WHERE "__time" >= TIMESTAMP'__MIN_TIME'
AND "__time" < TIMESTAMP'__MAX_TIME'
GROUP BY
__time,
"entityUID"
PARTITIONED BY 'P1M';
Problem:
After running this query, the tags
-like columns (tagSetA
, tagSetB
) are no longer in a multi-value format. This breaks downstream queries that rely on the multi-value nature of these columns.
My understanding:
MSQ might not support preserving multi-value columns directly, especially when using functions like LATEST_BY
.
Question:
How can I run this kind of deduplication query while preserving the multi-value format of these columns? Is there a recommended approach or workaround in Druid to handle this scenario?
Can someone help us with this problem, please?Vaibhav
06/18/2025, 7:16 PMorg.apache.druid.java.util.common.IAE: Asked to add buffers[2,454,942,764] larger than configured max[2,147,483,647]
at org.apache.druid.java.util.common.io.smoosh.FileSmoosher.addWithSmooshedWriter(FileSmoosher.java:168)
• On investigation: compaction produces 430 partitions
, but the 430th partition (with end=null
) gets an unusually high number of rows (~800M+ rows).
What I found:
- A GROUP BY
on the 5 range dimensions for a sample day gives ~11.5k unique combinations
eg :
SELECT range_dim1, range_dim2, range_dim3, range_dim4, range_dim5 , count(*) as row_count
WHERE __time < 1 day interval >
GROUP BY 1,2,3,4,5
ORDER BY 1,2,3,4,5
- However, partition 430 gets all combinations from ~9.5k to ~11.5k in one partition.
- This violates the targetRowsPerSegment: 5M
and maxRowsPerSegment: 7.5M
config.
Questions:
• Are there better strategies to ensure partitioning respects row count limits ?
• Is this behavior a bug or expected ?
Any advice or insights appreciated.Lionel Mena
06/20/2025, 9:41 AMStefanos Pliakos
06/25/2025, 11:56 AMreadinessProbe:
httpGet:
path: /status/health
port: 8082
scheme: HTTPS
initialDelaySeconds: 30
periodSeconds: 10
timeoutSeconds: 5
However, this way the startup probe is failing, an example for coordinators:
Warning Unhealthy 40s (x2 over 50s) kubelet Startup probe failed: Get "<http://172.31.75.247:8081/status/health>": dial tcp 172.31.75.247:8081: connect: connection refused
Warning Unhealthy 0s (x4 over 29s) kubelet Startup probe failed: Get "<http://172.31.75.247:8081/status/health>": net/http: HTTP/1.x transport connection broken: malformed HTTP response "\x15\x03\x03\x00\x02\x02P"
In the CRD there is no configuration for startupProbe so I can’t even change the scheme to HTTPS. Any insights on this? Or do I have to disable all TLS configuration already?
My configuration is:
druid.enablePlaintextPort=false
druid.enableTlsPort=true
druid.client.https.protocol=TLSv1.2
druid.client.https.trustStorePath=/opt/druid/conf/druid/cluster/_common/tls/truststore.jks
druid.client.https.trustStorePassword=${env:TRUSTSTORE_PASSWORD}
druid.client.https.trustStoreType=jks
druid.server.https.keyStorePath=/opt/druid/conf/druid/cluster/_common/tls/keystore.jks
druid.server.https.keyStorePassword=${env:KEYSTORE_PASSWORD}
druid.server.https.keyStoreType=jks
Nir Bar On
06/25/2025, 12:14 PMorg.apache.druid.server.metrics.TaskCountStatsMonitor
question is - on which component / s on druid stack it can be enabled ?Nir Bar On
06/25/2025, 2:13 PMNir Bar On
06/25/2025, 8:56 PMapurav sharma
06/25/2025, 11:42 PMJun 25 15:44:35 127.0.0.1 java.lang.RuntimeException: com.amazonaws.services.s3.model.AmazonS3Exception: The authorization header is malformed; the region 'us-east-1' is wrong; expecting 'eu-central-1' (Service: Amazon S3; Status Code: 400; Error Code: AuthorizationHeaderMalformed;
Imply was deployed on aws eks via helm chart.
Does anyone have any clue? I’m not sure if there is a change required in the values.yaml to specify region? If yes than what exactly goes into the parameter ? Or is it related to IAM permission?Nir Bar On
06/29/2025, 12:25 PMNeeraj Pmk
07/02/2025, 12:18 PMPHP Dev
07/03/2025, 8:29 AMUnknown provider [s3] of Key[type=org.apache.druid.tasklogs.TaskLogs]
s3-druid-extensions is loaded.
What can be wrong?Amperio Romano
07/07/2025, 12:46 PMtransformSpec: {
transforms: [
{
type: 'expression',
name: 'col1_and_col2_virtual',
expression: "concat(col1, '-', col2)"
}
]
},
and then I create a HLL datasketch as a metric in the same kafka ingestion to have it pre-aggregated so that I can count the distinct instances of it quickly
metricsSpec: [
{
name: 'col1_and_col2_hll',
type: 'HLLSketchBuild',
fieldName: 'col1_and_col2_virtual',
lgK: 12,
tgtHllType: 'HLL_4',
round: true
}
]
col1_and_col2_virtual
is not in the dimensions, so it is not stored, and everything looks good: it creates the col1_and_col2_hll
correctly. Both col1
and col2
are always filled.
The problem is when I try to calculate the number of distinct instances
select
COUNT(*) as num_of_rows,
APPROX_COUNT_DISTINCT_DS_HLL(col1_and_col2_hll) as hll_estimate
from "my_datasource"
hll_estimate
is greater than the num_of_rows, which sounds really strange to me. I know that it is an estimation, but estimating it more than the total is surprising. Am I doing something wrong? Thanks.Milad
07/07/2025, 5:10 PMCOMPLEX<KllDoublesSketch>
as the type on my sketch column. When I try to compute the median using APPROX_QUANTILE_DS
I get this error using a sql query:
Error: RUNTIME_FAILURE (OPERATOR)
class org.apache.datasketches.kll.KllDirectDoublesSketch$KllDirectCompactDoublesSketch cannot be cast to class org.apache.datasketches.quantiles.DoublesSketch (org.apache.datasketches.kll.KllDirectDoublesSketch$KllDirectCompactDoublesSketch and org.apache.datasketches.quantiles.DoublesSketch are in unnamed module of loader 'app')
java.lang.ClassCastException
Host: localhost:8083
I tried to run a query using the native query language and I get a slightly different error:
Error: undefined
Please make sure to load all the necessary extensions and jars with type 'kllDoublesSketchMerge' on 'druid/router' service. Could not resolve type id 'kllDoublesSketchMerge' as a subtype of `org.apache.druid.query.aggregation.AggregatorFactory` known type ids = [HLLSketch, HLLSketchBuild, HLLSketchMerge, KllDoublesSketch, KllDoublesSketchMerge, KllFloatsSketch, KllFloatsSketchMerge, arrayOfDoublesSketch, cardinality, count, doubleAny, doubleFirst, doubleLast, doubleMax, doubleMean, doubleMin, doubleSum, expression, filtered, floatAny, floatFirst, floatLast, floatMax, floatMin, floatSum, grouping, histogram, hyperUnique, javascript, longAny, longFirst, longLast, longMax, longMin, longSum, passthrough, quantilesDoublesSketch, quantilesDoublesSketchMerge, singleValue, sketchBuild, sketchMerge, stringAny, stringFirst, stringFirstFold, stringLast, stringLastFold, thetaSketch] (for POJO property 'aggregations') at [Source: (org.eclipse.jetty.server.HttpInputOverHTTP); line: 1, column: 141] (through reference chain: org.apache.druid.query.timeseries.TimeseriesQuery["aggregations"]->java.util.ArrayList[0])
com.fasterxml.jackson.databind.exc.InvalidTypeIdException
I noticed that the error mentions kllDoublesSketchMerge
and the known IDs has a capital K
at the front so maybe I did something wrong during ingestion? My ingestion command was:
REPLACE INTO "sketch-kll-test"
OVERWRITE ALL
WITH "ext" AS (
SELECT * FROM TABLE (
EXTERN(
'{"type":"local","files":["/sketch-test.csv"]}',
'{"type":"csv","findColumnsFromHeader":true}'
)
) EXTEND (
"Customer ID" BIGINT,
"Customer Name" VARCHAR,
"Library ID" BIGINT,
"Library Name" VARCHAR,
"Sketch" VARCHAR
)
)
SELECT
TIME_PARSE('2025-06-25') AS "__time",
"Customer ID",
"Customer Name",
"Library ID",
"Library Name",
DECODE_BASE64_COMPLEX('KllDoublesSketch', "Sketch") AS "Sketch"
FROM "ext"
PARTITIONED BY DAY
I've included a screenshot showing all the plugins and I think I have those loaded correctly. Everything seems to work with the normal DoublesSketch but I can't get the KLL sketch to work.
Thank YouPHP Dev
07/09/2025, 1:49 PMPrzemek
07/09/2025, 4:17 PMindex_kafka
tasks (cluster is deployed on Kuberenetes without druid-kubernetes-extensions
and druid-kubernetes-overlord-extensions
extensions). All kafka ingestion tasks ends with failures like Task [index_kafka_XYZ] failed to return start time, killing task
.
When I test my ingestion spec in data loader it is able to connect to Kafka topic without any problems.
I checked logs and I found on coordinator warnings like
2025-07-08T15:35:06,163 INFO [KafkaSupervisor-Golf_Lakehouse_Commentary_Feed_v2-Worker-0] org.apache.druid.rpc.ServiceClientImpl - Service [index_kafka_XYZ] request [GET <http://10.5.176.64:8101/druid/worker/v1/chat/index_kafka_XYZ/time/start>] encountered exception on attempt #1; retrying in 2,000 ms (org.jboss.netty.channel.ChannelException: Faulty channel in resource pool)
...
up to 8 tries and then
2025-07-08T15:36:10,172 WARN [KafkaSupervisor-Golf_Lakehouse_Commentary_Feed_v2] org.apache.druid.indexing.seekablestream.supervisor.SeekableStreamSupervisor - Task [index_kafka_XYZ] failed to return start time, killing task (org.apache.druid.rpc.RpcException: Service [index_kafka_XYZ] request [GET http://.../druid/worker/v1/chat/index_kafka_XYZ/time/start] encountered exception on attempt #9)
I saw also some RejectedExecutionException
not sure if correlated
java.util.concurrent.RejectedExecutionException: Task java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask@71ade902[Not completed, task = java.util.concurrent.Executors$RunnableAdapter@4726a5f5[Wrapped task = CallbackListener{org.apache.druid.server.coordination.ChangeRequestHttpSyncer$1@38bb1199}]] rejected from java.util.concurrent.ScheduledThreadPoolExecutor@2e9e06a1[Terminated, pool size = 0, active threads = 0, queued tasks = 0, completed tasks = 109941]
I found that even peons are not created when on all 3 of my middle_managers it shows that ~4 of 11 slots are used.
When I redeployed coordinator it started processing 2 of my index_kafka tasks (and 2 peons were visible) but rest of them were still failing. When they finished no new peons were visible and no new tasks ended with success.
What can be reason of that? I tried even with version 31.0.2 and result was the same. Did something changed with version 31+?Shanmugaraja
07/10/2025, 6:21 AMJohn Kowtko
07/10/2025, 2:17 PMCristian Daniel Gelvis Bermudez
07/10/2025, 6:31 PMVictoria
07/11/2025, 3:24 AMsandy k
07/11/2025, 1:24 PM