Apache Druid #dev

Ashi Bhardwaj

01/05/2025, 9:38 AM

Hi folks 👋 Can someone please review this PR? Thanks!

Maytas Monsereenusorn

01/07/2025, 8:38 PM

Hi! Could I get reviews on this PR when they have the time? https://github.com/apache/druid/pull/17439 Would be great to have it in v32. Thanks!

Maytas Monsereenusorn

01/18/2025, 1:18 AM

What do people think about emitting a histogram for Druid metric emission? I think currently all the metric values are just numbers

Hazmi

01/20/2025, 10:03 AM

Hey! Could someone please review this PR when they have the time? https://github.com/apache/druid/pull/17646

Suraj Goel

01/27/2025, 5:10 PM

Hi Team, Can someone please review this PR. TIA!

Suraj Goel

01/28/2025, 7:02 AM

Hi Team, Please review this PR to improve S3 upload speed.

info Advisionary

01/30/2025, 11:08 AM

I am working with Apache Druid 31.0.1 and I need to apply a spatial filter using a polygon shape. Specifically, I want to filter data based on whether a point falls within a given polygon. Can anyone provide an example of how to set up and use spatial filters with polygons in Druid? I’ve read through the documentation and tried using various filter options, but I’m having trouble with the correct syntax for defining the polygon in a spatial filter. I would appreciate any examples or pointers on how to structure this in Druid 31.0.1. there is no example of using polygon in spatial filters in druid's official documentation. any help will be appreciated

Maytas Monsereenusorn

01/31/2025, 6:05 AM

Can I get a quick review on https://github.com/apache/druid/pull/17652? Thanks!

Suraj Goel

02/13/2025, 10:13 AM

Hi Team, Please review this PR for bug fix - https://github.com/apache/druid/issues/17722 Related thread - https://apachedruidworkspace.slack.com/archives/C0309C9L90D/p1739196991316389 Thanks

Ashwin Tumma

02/21/2025, 1:10 AM

Hi, Can someone please help review these two PRs: https://github.com/apache/druid/pull/17744 https://github.com/apache/druid/pull/17745 Thanks!

Jamie Chapman-Brown

02/27/2025, 6:23 PM

Hi there! I'm the guy who added rabbitmq superstream ingestion to druid. It's been working well for us, thanks for the help getting it integrated! We've been using druid-exporter to get statistics out to Prometheus, and we used to have rabbitmq ingest stats to use in monitoring, like

druid_emitted_metrics{metric_name="ingest-rabbit-lag"}

. We've recently switched to using the Prometheus plugin, but I can't find any rabbit ingest stats. Am I missing anything? Can anyone point me to what I would need to change to get these stats back?

Mikhail Sviatahorau

03/05/2025, 3:51 PM

Hey all! Looking for input on how to prevent compaction tasks from publishing overlapping segments. Here's what happened: • Auto-compaction was running every 30 minutes, compacting daily-partitioned data. • A manual compaction reindexed the same period with a different granularity. • Once the manual compaction finished, some previously scheduled auto-compaction tasks (created before the reindexing) started running. • These tasks compacted data with the old granularity, causing a mix of granularities and overlapping segments (daily segments were created inside of monthly one). • After the data was successfully compacted, things stabilized. For a fix, we need to prevent autocompaction from publishing overlapping segments. We know which segments will be published only at the InputSourceProcessor level on the worker. The indexing task has a

coordinator-issued

prefix, which helps identify auto-compaction and could be used to check and reject publishing segments that don’t match the current cluster state. The problem is that this prefix isn’t accessible on the level where segments are being chosen. The options are to add some preprocessing of the input source at generateAndPublishSegments in IndexTask or to send some flag to the InputSourceProcessor.prosess() method, but both feel like a last resort. Curious to hear your thoughts — if not there, where do you think this could be best handled?

Maytas Monsereenusorn

03/24/2025, 7:07 PM

We can’t include dependency on GNU General Public License v2.0 library right?

Maytas Monsereenusorn

04/16/2025, 9:38 PM

Would it be possible to have the SegmentMetadata queries return distinct values of a column? Similar to how it already supports cardinality, this would just be looking at the keys in the string column dictionaries?

Maytas Monsereenusorn

04/23/2025, 7:36 PM

Do we have any plan on extending query lane to Historical and realtime ingestion task (i.e. Peons)? Having query lane only on the broker makes it less useful. For example, a single very expensive query (which would run on low lane on broker) could/would still takes up all the processing thread on historical and blocks other query from running on those historical. CC: @Clint Wylie (as author of the query lane on the broker). Thanks!!

Abhishek Balaji Radhakrishnan

05/07/2025, 12:31 AM

A fix for the

json_merge()

function when someone gets a chance: https://github.com/apache/druid/pull/17983. Thanks!

✅ 2

Abhishek Balaji Radhakrishnan

05/12/2025, 9:09 PM

Could someone take a look at this fix https://github.com/apache/druid/pull/17997 for the linked issue? Thanks!

✅ 1

Abhishek Balaji Radhakrishnan

05/23/2025, 1:42 AM

Could someone take a look at this fix for a segment unavailability related bug: https://github.com/apache/druid/pull/18025

Maytas Monsereenusorn

05/28/2025, 2:03 AM

Thinking about Threshold prioritization strategy (https://druid.apache.org/docs/latest/configuration/#threshold-prioritization-strategy) and was wondering if the following changes would make sense or not (haven’t tested any of these and don’t know if they will be useful in practise or not) • What if we can stack the violatesThreshold and penalize a query more if it violates multiple threshold. i.e.

Copy code

int toAdjust = 0
    if (violatesPeriodThreshold) {
      toAdjust += adjustment;
    }
    if (violatesDurationThreshold) {
      toAdjust += adjustment;
    }
    if (violatesSegmentThreshold) {
      toAdjust += adjustment;
    }
    if (violatesSegmentRangeThreshold) {
      toAdjust += adjustment;    
    }
    if (toAdjust != 0) {
      final int adjustedPriority = theQuery.context().getPriority() - toAdjust;
      return Optional.of(adjustedPriority);
    }

• What if we can set the adjustment value for each Threshold seperately? i.e.

Copy code

int toAdjust = 0
    if (violatesPeriodThreshold) {
      toAdjust += periodThresholdAdjustment;
    }
    if (violatesDurationThreshold) {
      toAdjust += durationThresholdAdjustment;
    }
    if (violatesSegmentThreshold) {
      toAdjust += segmentThresholdAdjustment;
    }
    if (violatesSegmentRangeThreshold) {
      toAdjust += segmentRangeThresholdAdjustment;    
    }
    if (toAdjust != 0) {
      final int adjustedPriority = theQuery.context().getPriority() - toAdjust;
      return Optional.of(adjustedPriority);
    }

The motivation for the first change is that if a query that violate N thresholds, it should be penalize more (not equal) to another query that violate N-1 thresholds. The motivation for the second change is that some violate are worst than other. i.e. periodThreshold is not that bad compare to segmentRangeThreshold. The prioritization value would then carry over to the Historical and can help with resources prioritization on Historical processing threadpool (related to this discussion https://apachedruidworkspace.slack.com/archives/C030CMF6B70/p1745436989786489). CC:@Gian Merlino @Clint Wylie

Abhishek Balaji Radhakrishnan

05/28/2025, 3:02 AM

Was looking into this feature: https://github.com/apache/druid/pull/13967. I just commented there, but I'm wondering if there are any plans to revive this PR?

Soman Ullah

05/28/2025, 7:15 PM

Hello, does lookup loading disabling work for kafka task? I see it works for MSQ task.

Jesse Tuglu

06/03/2025, 11:11 PM

Hi, wanted to open a thread on what folks think on storing commit metadata for multiple supervisors in the same datasource (e.g the operation here). Currently, Druid stores commit metadata 1:1 with datasource. This update is done either in a shared tx with segment publishing, or in an isolated commit. From what I can see, implementors of

DataSourceMetadata

are solely supervisor-based (either materialized view or seekable stream).

ObjectMetadata

seems to only be used in tests. The way I see it there are ≥ 2 options: • Commit a datasource metadata row per supervisor (likely the easiest, but will take some re-workings on the

SegmentTransactionalInsertAction

API and others, who assume these rows are keyed by

datasource

) – I'm currently doing this and it seems to work fine. • Commit a single row per datasource, storing partitions per supervisor ID and doing merges in the

plus

minus

methods ◦ Something like the payload being: ▪︎ map[supervisor_id] =

SeekableStreamSequenceNumbers

◦ This might suffer from write contention since N supervisors * M tasks per supervisor will be attempting to write new updates in the commit payload to this row in the DB.

Allen Madsen

06/10/2025, 9:39 PM

I could use some help thinking about how to tackle this problem. We have a table that's starting to become too big for a global lookup. We attempted to use a loading lookup, however, the initial queries are too slow. In a query I'm using to test, a fetch for a single record takes ~40ms. The problem is that, there are about 1000 distinct values, which equates to about ~30s of total runtime, because the loading lookup looks up each value independently. When I query all 1000 values together, the total time for the query is ~80ms. At first, I noticed applyAll could be overridden on the LookupExtractor and thought that may be a way to have it batch query lookups. However, I noticed that applyAll is never called. Abstractly, I'd like druid to tell the lookup to prefetch all the values it needs before joining or have the joinCursor iterate in batches and be able to make calls to the database. What are ya'll's thoughts on the best way to approach this problem?

Jesse Tuglu

06/13/2025, 1:46 AM

@Clint Wylie 👋 Do you folks still run Druid historical nodes with transparent huge pages disabled? (Tagging you as I see you were the author of this documentation commit)

Jesse Tuglu

06/17/2025, 7:31 PM

👋 @Zoltan Haindrich, I noticed in the current build that this line points to a personal repo of yours. It seems like during build, maven actually scans that repo for other modules (not just quidem):

Copy code

[INFO] ------------------< org.apache.druid:druid-quidem-ut >------------------
[INFO] Building druid-quidem-ut 34.0.0-SNAPSHOT                         [80/80]
[INFO]   from quidem-ut/pom.xml
[INFO] --------------------------------[ jar ]---------------------------------
Downloading from datasets: <https://raw.githubusercontent.com/kgyrtkirk/datasets/repo/org/apache/druid/extensions/druid-multi-stage-query/34.0.0-SNAPSHOT/maven-metadata.xml>
Downloading from datasets: <https://raw.githubusercontent.com/kgyrtkirk/datasets/repo/org/apache/druid/extensions/druid-datasketches/34.0.0-SNAPSHOT/maven-metadata.xml>
Downloading from datasets: <https://raw.githubusercontent.com/kgyrtkirk/datasets/repo/org/apache/druid/extensions/druid-orc-extensions/34.0.0-SNAPSHOT/maven-metadata.xml>
Downloading from datasets: <https://raw.githubusercontent.com/kgyrtkirk/datasets/repo/org/apache/druid/extensions/druid-parquet-extensions/34.0.0-SNAPSHOT/maven-metadata.xml>
Downloading from datasets: <https://raw.githubusercontent.com/kgyrtkirk/datasets/repo/org/apache/druid/extensions/druid-avro-extensions/34.0.0-SNAPSHOT/maven-metadata.xml>
Downloading from datasets: <https://raw.githubusercontent.com/kgyrtkirk/datasets/repo/org/apache/druid/extensions/druid-protobuf-extensions/34.0.0-SNAPSHOT/maven-metadata.xml>
Downloading from datasets: <https://raw.githubusercontent.com/kgyrtkirk/datasets/repo/org/apache/druid/extensions/druid-s3-extensions/34.0.0-SNAPSHOT/maven-metadata.xml>
Downloading from datasets: <https://raw.githubusercontent.com/kgyrtkirk/datasets/repo/org/apache/druid/extensions/druid-kinesis-indexing-service/34.0.0-SNAPSHOT/maven-metadata.xml>
Downloading from datasets: <https://raw.githubusercontent.com/kgyrtkirk/datasets/repo/org/apache/druid/extensions/druid-azure-extensions/34.0.0-SNAPSHOT/maven-metadata.xml>
Downloading from datasets: <https://raw.githubusercontent.com/kgyrtkirk/datasets/repo/org/apache/druid/extensions/druid-google-extensions/34.0.0-SNAPSHOT/maven-metadata.xml>
Downloading from datasets: <https://raw.githubusercontent.com/kgyrtkirk/datasets/repo/org/apache/druid/extensions/druid-hdfs-storage/34.0.0-SNAPSHOT/maven-metadata.xml>
Downloading from datasets: <https://raw.githubusercontent.com/kgyrtkirk/datasets/repo/org/apache/druid/extensions/druid-histogram/34.0.0-SNAPSHOT/maven-metadata.xml>
Downloading from datasets: <https://raw.githubusercontent.com/kgyrtkirk/datasets/repo/org/apache/druid/druid-aws-common/34.0.0-SNAPSHOT/maven-metadata.xml>
Downloading from datasets: <https://raw.githubusercontent.com/kgyrtkirk/datasets/repo/org/apache/druid/druid-processing/34.0.0-SNAPSHOT/maven-metadata.xml>
Downloading from datasets: <https://raw.githubusercontent.com/kgyrtkirk/datasets/repo/org/apache/druid/druid-sql/34.0.0-SNAPSHOT/maven-metadata.xml>
Downloading from datasets: <https://raw.githubusercontent.com/kgyrtkirk/datasets/repo/org/apache/druid/druid-indexing-service/34.0.0-SNAPSHOT/maven-metadata.xml>
Downloading from datasets: <https://raw.githubusercontent.com/kgyrtkirk/datasets/repo/org/apache/druid/druid-indexing-hadoop/34.0.0-SNAPSHOT/maven-metadata.xml>
Downloading from datasets: <https://raw.githubusercontent.com/kgyrtkirk/datasets/repo/org/apache/druid/extensions/mysql-metadata-storage/34.0.0-SNAPSHOT/maven-metadata.xml>
Downloading from datasets: <https://raw.githubusercontent.com/kgyrtkirk/datasets/repo/org/apache/druid/extensions/druid-kafka-indexing-service/34.0.0-SNAPSHOT/maven-metadata.xml>
Downloading from datasets: <https://raw.githubusercontent.com/kgyrtkirk/datasets/repo/org/apache/druid/extensions/druid-basic-security/34.0.0-SNAPSHOT/maven-metadata.xml>
Downloading from datasets: <https://raw.githubusercontent.com/kgyrtkirk/datasets/repo/org/apache/druid/extensions/druid-lookups-cached-global/34.0.0-SNAPSHOT/maven-metadata.xml>
Downloading from datasets: <https://raw.githubusercontent.com/kgyrtkirk/datasets/repo/org/apache/druid/extensions/druid-testing-tools/34.0.0-SNAPSHOT/maven-metadata.xml>
Downloading from datasets: <https://raw.githubusercontent.com/kgyrtkirk/datasets/repo/org/apache/druid/extensions/simple-client-sslcontext/34.0.0-SNAPSHOT/maven-metadata.xml>
Downloading from datasets: <https://raw.githubusercontent.com/kgyrtkirk/datasets/repo/org/apache/druid/druid-services/34.0.0-SNAPSHOT/maven-metadata.xml>
Downloading from datasets: <https://raw.githubusercontent.com/kgyrtkirk/datasets/repo/org/apache/druid/druid-server/34.0.0-SNAPSHOT/maven-metadata.xml>
Downloading from datasets: <https://raw.githubusercontent.com/kgyrtkirk/datasets/repo/org/apache/druid/druid-gcp-common/34.0.0-SNAPSHOT/maven-metadata.xml>

Wondering if you knew about this, and whether this was intentional cc @Gian Merlino

Soman Ullah

07/08/2025, 9:52 PM

Does MSQ replace always create tombstones? Is there a way to get rid of them if they don't have eternity timestamps?

Cristian Daniel Gelvis Bermudez

07/09/2025, 4:00 PM

Hello everyone, I'm trying to extract data from deep storage with a query to the /druid/v2/sql/statements/ endpoint. The task runs fine, but at the end, the following error occurs, preventing me from extracting the query response. { "error": "druidException", "errorCode": "notFound", "persona": "USER", "category": "NOT_FOUND", "errorMessage": "Query [query-9578562a-94f0-452d-998a-e66e0f7d0ff5] was not found. The query details are no longer present or might not be of the type [query_controller]. Verify that the id is correct.", "context": {} } Does anyone know why this happens?

Jesse Tuglu

07/15/2025, 1:53 AM

Hey folks, question: is

ListFilteredDimensionSpec

not allowed to have

null

as an element in its

values

array? Or is this a bug? Example:

Copy code

{
  "dimensions": [
    {
      "type": "listFiltered",
      "delegate": "based_on",
      "values": [
        null,
        "A",
        "B"
      ]
    }
  ],
  "aggregations": [
    {
      "type": "doubleSum",
      "fieldName": "value",
      "name": "value"
    }
  ],
  "intervals": [
    "2025-01-01T00:00:00.000Z/2025-07-10T23:59:59.999Z"
  ],
  "queryType": "groupBy",
  "granularity": "all",
  "dataSource": "datasource_A"
}

will fail with

Copy code

Error: RUNTIME_FAILURE (OPERATOR)

Cannot invoke "String.getBytes(String)" because "string" is null

java.lang.NullPointerException

The line in question is this. Passing in empty string in-place of the

NULL

returns null values, so this is a partial work-around for now.

Jesse Tuglu

07/17/2025, 6:20 PM

@Gian Merlino @Clint Wylie Wanted to get some clarification on whether ingesting empty strings in both batch/streaming ingests causes the value to be inserted as

NULL

into the segment when

druid.generic.useDefaultValueForNull=true

. Is this the expected behavior? See the data loader photo attached, where it appears to show parsing row 3's

string_value

as an empty string. However, post-segment creation, I took a dump of the segment:

Copy code

{"__time":1704070860000,"title":"example_1","string_value":"some_value","long_value":1,"double_value":0.1,"float_value":0.2,"multi_value":["a","b","c"],"count":1,"double_value_doubleSum":0.1,"float_value_floatSum":0.2,"long_value_longSum":1}
{"__time":1704070920000,"title":"example_2","string_value":"another_value","long_value":2,"double_value":0.2,"float_value":0.3,"multi_value":["d","e","f"],"count":1,"double_value_doubleSum":0.2,"float_value_floatSum":0.3,"long_value_longSum":2}
{"__time":1704070980000,"title":"example_3","string_value":null,"long_value":0,"double_value":0.0,"float_value":0.0,"multi_value":null,"count":1,"double_value_doubleSum":0.0,"float_value_floatSum":0.0,"long_value_longSum":0}
{"__time":1704071040000,"title":"example_4","string_value":null,"long_value":0,"double_value":0.0,"float_value":0.0,"multi_value":null,"count":1,"double_value_doubleSum":0.0,"float_value_floatSum":0.0,"long_value_longSum":0}

you can see that in row 3, the

string_value

column has replaced what I'd expect to be an

""

with a

null

. This is running on version v31, with

druid.generic.useDefaultValueForNull=true

. I've tested that running on v33 with

druid.generic.useDefaultValueForNull=false

produces the expected result (

""

stored instead of

null

Chris Warren

07/24/2025, 4:23 PM

When creating extensions that depend on other extensions... is there a way to actually load those dependencies reliably? Specifically, I'm trying to implement the

RoleProvider

interface that is defined in the druid-basic-security extension, but if I try to make my extension a standalone extension (using the

provided

scope for the

druid-basic-security

dependency in my

pom.xml

) I get

Exception in thread "main" java.lang.NoClassDefFoundError: org/apache/druid/security/basic/authorization/RoleProvider

... however if I shade the jar, then it does work. Also if I build it into

druid-basic-security

that also works. Am I missing something with how the interface might be implemented outside of this core extension package? Anyone have experience with this?