Apache Druid #docs-and-training

Tapajit Chandra Paul

07/02/2024, 5:07 AM

Hello everyone, I wanted to know, if I want to backup druid, what exactly do I need to take backup? Deep Storage, metadata storage or segment cache?

Noor

07/15/2024, 7:24 AM

Hello druid family, I am looking for the way to remove the duplicates from a datasource if it ingested duplicate rows. Is their anyway we can achieve in druid(version 25)

Paweł Motyka

07/22/2024, 11:19 AM

Hey, I seek help on working with Sketches in Druid. What the proper way of calculating a sum of sketches in my datasource? So far I am doing:

Copy code

SELECT entity_id, SUM(HLL_SKETCH_ESTIMATE("hll_column")) FROM "test_hll_dataset" GROUP BY 1

But that is quite slow (takes over 20s to calculate); is there a more Druid-way to make this calculation?

Agata Klof

07/29/2024, 7:48 AM

Hi everyone, I am wondering if there is an option in Druid to be notified when a new data has been added to the data source? I can't find any information in docs, but maybe you have some knowledge?

Noor

07/29/2024, 3:09 PM

Hi druid family, If the incoming events are getting thrown away or unparseable in druid for millions , then will druid close the task with unsuccessful status. If so what is limits of these data which druid will not close the task

Victoria Lim

07/31/2024, 5:22 PM

👋 Introducing the migration guides for Apache Druid, developed to help you understand the latest and greatest features, and how to migrate to them when there are breaking changes. See the migration guides for the following topics: • Multi-value dimensions to arrays • Front-coded dictionary encoding • Byte-based memory limits on subqueries • SQL compliant null handling Learn more at https://druid.apache.org/docs/latest/release-info/migration-guide

👏 4

Puneet Arora

08/30/2024, 6:13 AM

Hi Team, I am working on a tool to streamline and improve documentation contributions for open source projects. I would love to get in touch with someone maintaining docs for Druid and discuss with them. You could read more about the tool here: https://ekline.atlassian.net/wiki/external/MDJmNWI3MjBkMzE4NGQyMzhlYjkxNjUwNDVhYjBlOGQ

Utkarsh Chaturvedi

09/03/2024, 6:52 AM

Hey Druids. My team has been working on lots of druid related features. We've written a medium blog on Query Level Monitoring via Request Logging. It allows us to set up individual query level monitoring. Do check it out!

🙌 2

Rahul Sharma

09/04/2024, 10:24 AM

🚀 Introducing MIDAS: The Intelligent Autoscaler for Druid’s MiddleManager! 🚀 Hey Druid community! 🌟 Ever felt the pain of wrestling with scaling Druid’s MiddleManager service on Kubernetes? I did too—until I built MIDAS (Middle Manager Intelligent Dynamic Autoscaler)💡 You may have tried using HPA(Horizontal Pod Scaling) to autoscale your MiddleManagers, you might have faced issues like scaling down even when tasks are running, or spinning up more MiddleManagers than necessary. 😅 But with MIDAS, you can autoscale your MiddleManagers without any ingestion failures, ensuring that only the resources required to fulfill ingestion requests are used.🕺🏻 Using MIDAS we have cut our Druid infrastructure costs by 30%🤘🏻. Now, we can scale Druid’s MiddleManagers without worrying about unexpected ingestion failures, and we’re paying only for the resources we actually use🥳 Curious about how MIDAS can turn your chaotic infrastructure into a well-oiled machine? Check out my latest article on Druid Autoscaling: Medium 🔗 The best part? MIDAS is open-sourced! 🎉 Feel free to explore, contribute, or just marvel at how it can revolutionize your Druid deployments. Let’s make Druid even more powerful together! 💪. Repository 🔗 Looking forward to your thoughts and feedback! Feel free to connect with me on LinkedIn 🔗 Cheers, Rahul

🙌 5

Evan Rusackas

09/19/2024, 4:48 PM

Hey Druid folks While not quite docs, per se, I don't see a more appropriate channel to ask, so here goes: A couple of us PMC members on the Apache Superset project started a little podcast talking to folks from different DB teams. We'd love to have a PMC member or three from the Druid project over for a beer and some chat. DM if interested.

Marcos Maia

09/22/2024, 10:52 AM

Hi, I am playing with the learn-druid repo, great stuff BTW, and I am a bit confused on why the port 8888 from the host keeps redirecting to the jupyter notebook instead of the Druid console? When I look at the compose mappings I don't see anything specific that would cause that, any ideas?

Peter Marshall

09/30/2024, 10:59 AM

Hey all! Some notebook updates pushed to learn-druid this last week. Thanks @Hugh Evans @Charles Smith for the work on these 🙂 • Tiering • Async query on historical data • Async query on real-time data Thanks also to @Hugh Evans who's been working to spellcheck (!) and generally clean-up the notebooks behind-the-scenes.

👏 1

🙌 1

Divit Bui

10/06/2024, 3:32 PM

Hey all, I'm playing around with single server druid and was wondering if it's possible later down the road to easily migrate to multi-server druid deployments with kafka ingestion (or any streaming ingestion)? I'm assuming the multi-server deployment will just populate the historicals or fetch from the deep storage and the ingestion task will read kafka offsets from the DB?

rishikesh

11/04/2024, 9:54 AM

Hey Druid community! 🌟 I'm looking to implement MM-less in Druid 30. Could anyone share relevant links or documentation to get started? Thanks!

Stefanos Pliakos

11/06/2024, 10:55 AM

Hey team! Can I ask what is your take on Druid Operator? Is it mature for production use? I am thinking of migrating Druid from EC2 VM instances to Druid Operator as a means for better scalability and cutting down costs, as EC2 instances scaling is not so great. For example, having to scale from r7i.4xlarge to r7i.8xlarge at some point in time represents an almost double in cost while (I hope) with K8s and Druid Operator we can just reserve just some more memory. What is your experience and what do you think about that?

Zhenyu Yang

11/19/2024, 12:44 AM

hello everyone ,i am a developer from china

👋 3

Utkarsh Chaturvedi

11/20/2024, 10:34 AM

Hi Druids! Hope you're all well. I had some questions for druid_processing_numThreads. We are seeing the average CPU for the historical is around 10-20, reaching a max of 30% during the day. The value for the druid_processing_numThreads currently is set to 20 for our 16core, 128Gb box. The docs recommend keeping it at : numcores - 1. But since we're in such under utilisation, I wanted to increase the number of threads to about 40. Was wanting to know, how others have the config for the same. If you have the following numbers kindly comment below with a one liner for your experience of the same. This would specifically be for the case of running separate servers for Druid, not single server deployments. Kindly add any of the following below : CPU cores druid_processing_numThreads Average CPU Utilisation P99 Query time Any major experience upsides/downsides. Thank you guys for your consideration and time.

Hardik Bajaj

12/03/2024, 12:23 PM

Hey Team! I have a doubt on TopN queries. If I do a topN query with filters on a dimension to include only two values A and B. Let's say this dimension is a high cardinality. Does adding filters to only include A and B makes now makes sure that answer will always be 100% accurate. That being said, can I now keep the topN threshold value too as 2 and then too there would be no approximate results. So my question is does TopN first reads top K dimension values and then filter it, or filtering of dimension values from Dictionary happen first ? I hope my question makes sense

Utkarsh Chaturvedi

12/09/2024, 11:34 AM

Hi All! I was reading the Druid documentation for segment loading and mmap. The docs say that : "The segment cache uses memory mapping*. The cache consumes memory from the underlying operating system so Historicals can hold parts of segment files in memory to increase query performance at the data level*". Followed by "At query time*, if the required part of a segment file is available in the memory mapped cache or "page cache", the Historical re-uses it and reads it directly from memory. If it is not in the memory-mapped cache, the Historical reads that part of the segment from disk. In this case, there is potential for new data to flush other segment data from memory.*" My understanding of this is that each segment is mmap-ed and that a % of each segment is available in RAM (Address space of historical). If the part of the segment required for query is in this page cache, it will get used. If not; the newly accessed data of the same segment will replace the old page cache of the same the segment. Is this a correct understanding? Does this mean that % of each segment cached using mmap is approximately equal for each segment? Does a frequently accessed segment not get an increased mmap-ed percentage? And by increasing the RAM the % of segment cacheable simply increases and hence request speeds are faster?

Peter Marshall

12/16/2024, 7:59 AM

Resharing here :)

👍 1

Seki Inoue

12/18/2024, 2:01 AM

Hello team, Do you still not recommend enabling

dropExisting

for compaction config? Can I hear your experience with this option? I want to save our S3 cost but I hesitate enabling this because I don't want to loose any data.

Noor

01/20/2025, 10:27 AM

Hi team, is there any way to set the query timeout in supervisor level. I need to set the query timeout in supervisor leve

David Adams

02/06/2025, 10:31 PM

Hey team, it seems like there may be some incorrect examples in the Druid documentation for handling SQL queries against Multi Value Dimensions. Doc reference here. Based on the documentation, assuming a dataset like this:

Copy code

{"timestamp": "2011-01-12T00:00:00.000Z", "label": "row1", "tags": ["t1","t2","t3"]}
{"timestamp": "2011-01-13T00:00:00.000Z", "label": "row2", "tags": ["t3","t4","t5"]}
{"timestamp": "2011-01-14T00:00:00.000Z", "label": "row3", "tags": ["t5","t6","t7"]}
{"timestamp": "2011-01-14T00:00:00.000Z", "label": "row4", "tags": []}

A query of

Copy code

SELECT label, tags
FROM "mvd_example_rollup"
WHERE tags = 't3'
GROUP BY 1,2

should return an exploded view of all tags within each MVD that got matched (filter applied pre-explosion):

Copy code

{"label":"row1","tags":"t1"}
{"label":"row1","tags":"t2"}
{"label":"row1","tags":"t3"}
{"label":"row2","tags":"t3"}
{"label":"row2","tags":"t4"}
{"label":"row2","tags":"t5"}

However, in my V30 instance, the behavior I see is instead returning a result that suggests a post-explosion filter. The results I get returned are effectively as follows:

Copy code

{"label":"row1","tags":"t3"}
{"label":"row2","tags":"t3"}

Are any gurus able to decipher if this is a simple documentation issue, or is my cluster behaving in an unintended way?

Andrea Licata

02/21/2025, 5:43 PM

Hi, I was trying to use the configuration parameter "druid.server.loadSegmentsOnStartup" to avoid loading all segments from deep storage at every historical startup. I noticed that it is not documented, but I found this method on GitHub: https://github.com/apache/druid/blob/9df92230979342d6179c6f8ba94ee9efbfb4983c/server/src/main/java/org/apache/druid/server/coordination/SegmentBootstrapper.java#L172 Does anyone have experience with this?

Sivakumar Karthikesan

03/26/2025, 6:37 PM

Hi Team..im setting up druid emitter prometheus , does anyone has sample json for grafana dashboard and expersion ?

Dinesh

03/28/2025, 10:45 AM

@Sivakumar Karthikesan try this dashboard.

Pods_New-1707999228368.json

akshat

04/25/2025, 7:02 AM

🚀 Thinking of going nanoservices in production? It’s not just about smaller services—state management & orchestration are your real battlegrounds ⚔️ 🔍 Real-world challenges 🧠 Battle-tested solutions 🛠️ YAML workflows & rollback patterns 👇 Dive in: https://medium.com/@akshat111111/b4d6c5925b1d

mehrdadbn9

05/24/2025, 5:03 PM

Hi i wanna let most of my middle manager been used but mostly 60 or 70 percentage is using I added task priority and kill task weight 19 And maxconcurrenttask to 25 We have2g to heap and 6 middlemanager and15 slot but mostly8 or 9 of them been used Is it ok and what shouldd i change and should give more memory or i just can change value to higher number to have more task and kill task not got waiting? Tnx in advance

mehrdadbn9

05/30/2025, 4:31 PM

Is there any procedure and recommendations for migration from HDFS version to new version?(If there is doc pls let me know)

mehrdadbn9

07/13/2025, 6:09 PM

Is there any web authentication for ui I should track user and know who have access to web ui !!?