Suraj Goel
01/17/2025, 5:11 AMMohit Dhingra
01/21/2025, 8:42 AMSam
01/21/2025, 10:56 PMAn optimized virtual column allows Druid to read and filter these values at speeds consistent with standard Druid LONG, DOUBLE, and STRING columns.It looks like we don't need to flatten the nested columns since the performance may be similar. Is there a case of not using the nested columns but flattening the nested field?
AR
01/22/2025, 4:30 PMGiorgio Pellero
01/22/2025, 7:06 PMUNNEST
a nested JSON column at ingestion time?
to be more precise, I'm using streaming ingestion from Kafka and each source row has a column that looks like this (simplified): {"items": [{"key": "item1", "value": {...}}, {"key": "item2", "value": {...}}, ..., {"key": "itemN", "value": {...}}]}
- that is, items
is not constant-sized.
to make the data easier to work with at query time I'd like to UNNEST
it so that it results in something like this:
__time,item_key,item_value
1,item1,`{...}`
2,item2,`{...}`
...
N,itemN,`{...}`
where item_value
is a COMPLEX<json>
.
I can easily do this when batch ingesting using SQL, for example:
INSERT INTO "my_new_table"
SELECT
__time,
JSON_VALUE(item, '$.key') AS item_key,
JSON_VALUE(item, '$.value') AS item_value
FROM "original_table"
CROSS JOIN UNNEST(JSON_QUERY_ARRAY(original_column, '$.items')) AS item
PARTITIONED BY ...
so is this sort of unnesting possible for streaming ingestion?David Adams
01/23/2025, 4:54 AMcacheSize
space)
2. On the task heap (one cache per task, requiring cacheSize * tasks
space)
I suspect it's on the task heap, but I want to double check with someone more familiar before diving into the source code for a verdict.Slackbot
01/23/2025, 2:03 PMoscar de la cruz
01/24/2025, 10:34 PMAndrew Ho
01/27/2025, 10:39 PMSiva praneeth Alli
01/29/2025, 6:04 PMSiva praneeth Alli
01/30/2025, 5:30 PMREPLACE <table> OVERWRITE WHERE
and given that my table already has data for a time chunk for which there is no data in my input source(but there is data for other time chunks), then when (1) new segments are build and (2) old segments are deleted (since i dont have data in my input source for some existing time chunk) then: Is the visibility of new segments and deletion of old segments atomic? Or for some time do both old segments(to be deleted) and new segments returns data since deleted segments and new segments are for different time chunks? TLDR, does MVCC apply for data updates or for deletions as well?Carlos M
02/03/2025, 7:34 PMbasic-cluster-tuning
saying Having a heap that is too large can result in excessively long GC collection pauses, the ~24GiB upper limit is imposed to avoid this.
I remember that line was there from really early versions of druid when only Java 8 was supported; is that still the case for Java 17?Kiran Kumar Puttakota
02/10/2025, 9:22 AMKiran Kumar Puttakota
02/10/2025, 9:22 AMKiran Kumar Puttakota
02/10/2025, 10:14 AMJRob
02/11/2025, 6:49 PMNimrod Lahav
02/12/2025, 9:38 AMLOW 7
MEDIUM 51
HIGH 44
CRITICAL 9
see attached CVE report
did anyone face this issue? has a security report I can share that has some explanations / suppression listsTest-Bibek
02/12/2025, 11:02 AMTest-Bibek
02/12/2025, 11:04 AMTest-Bibek
02/12/2025, 11:15 AMSam
02/13/2025, 2:01 AMrate
method from Prometheus to calculate the rate per minute of a counter value while handing the counter resets?Sam
02/13/2025, 2:03 AMymcao
02/17/2025, 3:29 AMShubham Pratik
02/17/2025, 7:53 AMShubham Pratik
02/17/2025, 7:55 AMJulian Larralde
02/17/2025, 11:12 AMTarun Kancherla Chowdary
02/24/2025, 7:20 AMSamarth Jain
02/25/2025, 8:54 AMgetCacheStrategy()
method it doesn't have its own implementation and so is returning null from the default implementation. As a result, isQueryCacheable()
always returns false for this query type at here. Is there a reason why scan query results are not cached?AR
02/25/2025, 1:38 PMJRob
02/25/2025, 9:14 PM{
"Timestamp": "2025-02-24T14:13",
"Results": [
{
"Key": "AAA",
"Value": 35
},
{
"Key": "BBB",
"Value": 44
}
]
}
I'd like to end up with rows aggregated hourly like:
__time, Key, sum_Value
2025-02-24T14:00, AAA, 33675
2025-02-24T14:00, BBB, 44876
Or at the very least be able to run queries like:
SELECT Key, SUM(sum_Value)
FROM datasource
WHERE __time >= CURRENT_TIMESTAMP - INTERVAL 1 DAY
GROUP BY 1
The important point is there is a json list containing dimensions and metrics. I would like to group by the dimensions in that list while aggregating the associated metrics. I'm expecting peak load around 600 K events / second. Storing the metrics without aggregation is not ideal.