Apache Druid #general

Suraj Goel

01/17/2025, 5:11 AM

Hi Team, We are planning to upgrade from Druid 25 to Druid 30. Is it safe to upgrade directly to 30. Is there any recommendation for the upgrade process ?

Mohit Dhingra

01/21/2025, 8:42 AM

Hi Team, org.apache.druid.server.metrics.TaskSlotCountStatsMonitor already configured in runtime properties of overload. Still taskSlot/used/count metrics are not being sent by Druid. Any suggestions ?

Sam

01/21/2025, 10:56 PM

Hi team, I want to know when to use nested columns and when to use flat nested columns. From the Druid document, https://druid.apache.org/docs/latest/querying/nested-columns/

An optimized virtual column allows Druid to read and filter these values at speeds consistent with standard Druid LONG, DOUBLE, and STRING columns.

It looks like we don't need to flatten the nested columns since the performance may be similar. Is there a case of not using the nested columns but flattening the nested field?

01/22/2025, 4:30 PM

Team - We have several dictionaries but don't want to load all of them in each realtime ingestion task as they take up unnecessary memory and we need to increase the heap size every time a new dictionary is added. Is there a way to avoid loading dictionaries or load them selectively in realtime ingestion tasks? The batch ingestion tasks (classic or MSQ) do not load the dictionaries since they don't support querying. Is this understanding correct? Thanks, AR.

Giorgio Pellero

01/22/2025, 7:06 PM

hey team! very new to Druid - we're testing it out and I'm trying to get up to speed with it. so far it's been great! 🙂 I've got a question I haven't been able to find an answer for in the official docs: it is possible to
UNNEST
a nested JSON column at ingestion time? to be more precise, I'm using streaming ingestion from Kafka and each source row has a column that looks like this (simplified):

{"items": [{"key": "item1", "value": {...}}, {"key": "item2", "value": {...}}, ..., {"key": "itemN", "value": {...}}]}

- that is,

items

is not constant-sized. to make the data easier to work with at query time I'd like to

UNNEST

it so that it results in something like this:

Copy code

__time,item_key,item_value
1,item1,`{...}`
2,item2,`{...}`
...
N,itemN,`{...}`

where

item_value

is a

COMPLEX<json>

. I can easily do this when batch ingesting using SQL, for example:

Copy code

INSERT INTO "my_new_table"
SELECT
  __time,
  JSON_VALUE(item, '$.key') AS item_key,
  JSON_VALUE(item, '$.value') AS item_value
FROM "original_table"
CROSS JOIN UNNEST(JSON_QUERY_ARRAY(original_column, '$.items')) AS item
PARTITIONED BY ...

so is this sort of unnesting possible for streaming ingestion?

David Adams

01/23/2025, 4:54 AM

The document speaks of caching on Historicals and on Brokers and how these are kept in the heap. Easy enough, since it's the process heap. However, when doing streaming ingestion on Middle Managers, I'm a bit confused on where it logically resides. Does the realtime cache exist: 1. On the central process heap (one cache per node, requiring

cacheSize

space) 2. On the task heap (one cache per task, requiring

cacheSize * tasks

space) I suspect it's on the task heap, but I want to double check with someone more familiar before diving into the source code for a verdict.

Slackbot

01/23/2025, 2:03 PM

This message was deleted.

👀 1

oscar de la cruz

01/24/2025, 10:34 PM

hello everybody, i'm just started to use druid. And can not initialize postgresql-metadata-storage. Can somebody help with this issued? I'm trying to deploy druid using docker

Andrew Ho

01/27/2025, 10:39 PM

Hi Druid experts! Wanted to get some feedback around a potential behavior change for the /druid/indexer/v1/supervisor endpoint. The current behavior when an ingestion spec is submitted is to stop then start the supervisor. Regardless of whether the underlying spec actually changed, the supervisor is always restarted. The behavior change we wanted to introduce is to first check if the spec has changed. If it hasn't, then do nothing. Otherwise just proceed with the existing behavior. Our use case is that we have some automation which hits the supervisor endpoint to update the schema in the ingestion spec. It's often unclear which specs have changed, so we would like to be able to just submit all of them to the supervisor endpoint and have it be restarted only if the spec has changed. Please let me know your thoughts, and happy to contribute if we think this is valid behavior

Siva praneeth Alli

01/29/2025, 6:04 PM

Hi Druid experts, for kafka indexing task, when intermediateHandoffPeriod < taskDuration, then task hand's off segment early. During handoff, does ingestion pause until handoff is complete?

Siva praneeth Alli

01/30/2025, 5:30 PM

Good morning folks, a question regarding MVCC. When an SQL ingestion spec uses

REPLACE <table> OVERWRITE WHERE

and given that my table already has data for a time chunk for which there is no data in my input source(but there is data for other time chunks), then when (1) new segments are build and (2) old segments are deleted (since i dont have data in my input source for some existing time chunk) then: Is the visibility of new segments and deletion of old segments atomic? Or for some time do both old segments(to be deleted) and new segments returns data since deleted segments and new segments are for different time chunks? TLDR, does MVCC apply for data updates or for deletions as well?

Carlos M

02/03/2025, 7:34 PM

Hello, I was reading the documentation and noticed the entry on the

basic-cluster-tuning

saying

Having a heap that is too large can result in excessively long GC collection pauses, the ~24GiB upper limit is imposed to avoid this.

I remember that line was there from really early versions of druid when only Java 8 was supported; is that still the case for Java 17?

Kiran Kumar Puttakota

02/10/2025, 9:22 AM

Hello all,

Kiran Kumar Puttakota

02/10/2025, 9:22 AM

Anyone please answer my question, How to remove the role access to a specific user using API.. Thanks

Kiran Kumar Puttakota

02/10/2025, 10:14 AM

Ok, But you sent the role assignment to a user, I want a role to remove from the user. How can I do that ??

JRob

02/11/2025, 6:49 PM

We're running into deep storage as a limitation for growing our cluster. Has anyone else run into this and/or have some solutions? Our deep storage is a 17 TB mount and our turnover is ~ 3 TB / day.

Nimrod Lahav

02/12/2025, 9:38 AM

Hello, were trying to adopt druid in my org but security team is blocking us from installing since image scans are returning the following CVE's in latest images (and older versions are even worse)

Copy code

LOW       7
  MEDIUM    51
  HIGH      44
  CRITICAL  9

see attached CVE report did anyone face this issue? has a security report I can share that has some explanations / suppression lists

druid-32.0.0.txt

Test-Bibek

02/12/2025, 11:02 AM

added an integration to this channel: Test-Bibek

Test-Bibek

02/12/2025, 11:04 AM

This is a test notification for "Ingestion-test" Alert summary would go here Bibek Sahoo

Test-Bibek

02/12/2025, 11:15 AM

removed an integration from this channel: Test-Bibek

Sam

02/13/2025, 2:01 AM

Hi team, is there a way to let Druid mimic the

rate

method from Prometheus to calculate the rate per minute of a counter value while handing the counter resets?

Sam

02/13/2025, 2:03 AM

I didn't find a native method that can do this. Does this mean Druid favors delta data (like gauge) instead of a counter?

ymcao

02/17/2025, 3:29 AM

Hi team, when deploying Druid on AWS, is it common practice to run both Historical and MiddleManager services on the same I-series machine? Additionally, I’ve noticed that the Bottlerocket OS is not supported on I-series(instance-store) models. Are there any solutions or workarounds for this? Any insights would be greatly appreciated. Thank you! ❤️

Shubham Pratik

02/17/2025, 7:53 AM

Hi Team, Anyone did Keycloak integration with druid?

Shubham Pratik

02/17/2025, 7:55 AM

Hi Team, Anyone tried AWS secret manger for postgres credentials

Julian Larralde

02/17/2025, 11:12 AM

Is there any analysis done on Druid native-batch ingestion performance that compares JSON(zipped)/Parquet/Avro ? Which is the most performant one?

Tarun Kancherla Chowdary

02/24/2025, 7:20 AM

Hi Team, I'm Tarun.. I'm new here.. Looking forward for interaction and valuable insights

Samarth Jain

02/25/2025, 8:54 AM

It looks like results of the SCAN query type, both whole query cache and intermediate segment level results, are not cached . Looking at the

getCacheStrategy()

method it doesn't have its own implementation and so is returning null from the default implementation. As a result,

isQueryCacheable()

always returns false for this query type at here. Is there a reason why scan query results are not cached?

02/25/2025, 1:38 PM

Hi All, If we set a broadcast forever rule for a table, does it push the table down to the historicals as well? Document mentions about brokers not about the historicals. If the retention for a table is set as "broadcast forever", would a direct join with it run directly on the historical rather than pulling in the data to the broker to perform the join? We are looking to improve the join performance in native queries as opposed to repeated re-ingestion to write the updated data. Separately, I chanced upon the below article on "indexed tables" in Druid. Is this something that has been abandoned? https://support.imply.io/hc/en-us/articles/360051201993-Druid-indexed-tables-alpha Thanks, AR.

JRob

02/25/2025, 9:14 PM

Is it possible for Druid to generate multiple rows per events? I have a json structure like:

Copy code

{
  "Timestamp": "2025-02-24T14:13",
  "Results": [
    {
      "Key": "AAA",
      "Value": 35
    },
    {
      "Key": "BBB",
      "Value": 44
    }
  ]
}

I'd like to end up with rows aggregated hourly like:

Copy code

__time,           Key, sum_Value
2025-02-24T14:00, AAA,     33675
2025-02-24T14:00, BBB,     44876

Or at the very least be able to run queries like:

Copy code

SELECT Key, SUM(sum_Value)
FROM datasource
WHERE __time >= CURRENT_TIMESTAMP - INTERVAL 1 DAY
GROUP BY 1

The important point is there is a json list containing dimensions and metrics. I would like to group by the dimensions in that list while aggregating the associated metrics. I'm expecting peak load around 600 K events / second. Storing the metrics without aggregation is not ideal.