https://linen.dev logo
Join Slack
Powered by
# docs-and-training
  • t

    Tarun Dhandharia

    06/26/2024, 11:20 AM
    Hi Team, is there any document detailing how to run the Overlord as a standalone service. we are not able to get the right executable and config to run the overlord. Or is there any config to run the overlord on a different port than the cordinator. Any help is highly appreciated. Thanks
    l
    • 2
    • 1
  • l

    Lionel Mena

    06/26/2024, 1:53 PM
    Hello there, going trough the Basic cluster tunning guide Overlord/Coordinator's sections there is no mention of the Direct Memory buffers, does that means that it's safe to run without passing these flag
    -XX:MaxDirectMemorySize
    ?
    b
    • 2
    • 2
  • c

    Charles Smith

    06/27/2024, 8:02 PM
    Hey everybody! Druid Summit is back and IN-PERSON this year – Tuesday, October 22, 2024 in Redwood City, California! This is your chance to join your fellow Druid community members, developers, and data professionals as they share experiences with real-time analytics solutions powered by the Apache Druid database – maybe even share your own! This year’s event will feature keynotes from Druid experts, educational sessions, interactive spaces, and real-world case studies of Druid in production. Want to speak at Druid Summit? Submit your 25 or 45 minute talk by August 5: https://bit.ly/45Qh4iv
    druid 1
  • t

    Tapajit Chandra Paul

    07/02/2024, 5:07 AM
    Hello everyone, I wanted to know, if I want to backup druid, what exactly do I need to take backup? Deep Storage, metadata storage or segment cache?
    s
    r
    • 3
    • 9
  • n

    Noor

    07/15/2024, 7:24 AM
    Hello druid family, I am looking for the way to remove the duplicates from a datasource if it ingested duplicate rows. Is their anyway we can achieve in druid(version 25)
    l
    j
    • 3
    • 11
  • p

    Paweł Motyka

    07/22/2024, 11:19 AM
    Hey, I seek help on working with Sketches in Druid. What the proper way of calculating a sum of sketches in my datasource? So far I am doing:
    Copy code
    SELECT entity_id, SUM(HLL_SKETCH_ESTIMATE("hll_column")) FROM "test_hll_dataset" GROUP BY 1
    But that is quite slow (takes over 20s to calculate); is there a more Druid-way to make this calculation?
    b
    p
    • 3
    • 5
  • a

    Agata Klof

    07/29/2024, 7:48 AM
    Hi everyone, I am wondering if there is an option in Druid to be notified when a new data has been added to the data source? I can't find any information in docs, but maybe you have some knowledge?
    p
    • 2
    • 2
  • n

    Noor

    07/29/2024, 3:09 PM
    Hi druid family, If the incoming events are getting thrown away or unparseable in druid for millions , then will druid close the task with unsuccessful status. If so what is limits of these data which druid will not close the task
    d
    b
    • 3
    • 2
  • v

    Victoria Lim

    07/31/2024, 5:22 PM
    👋 Introducing the migration guides for Apache Druid, developed to help you understand the latest and greatest features, and how to migrate to them when there are breaking changes. See the migration guides for the following topics: • Multi-value dimensions to arrays • Front-coded dictionary encoding • Byte-based memory limits on subqueries • SQL compliant null handling Learn more at https://druid.apache.org/docs/latest/release-info/migration-guide
    👏 4
  • p

    Puneet Arora

    08/30/2024, 6:13 AM
    Hi Team, I am working on a tool to streamline and improve documentation contributions for open source projects. I would love to get in touch with someone maintaining docs for Druid and discuss with them. You could read more about the tool here: https://ekline.atlassian.net/wiki/external/MDJmNWI3MjBkMzE4NGQyMzhlYjkxNjUwNDVhYjBlOGQ
  • u

    Utkarsh Chaturvedi

    09/03/2024, 6:52 AM
    Hey Druids. My team has been working on lots of druid related features. We've written a medium blog on Query Level Monitoring via Request Logging. It allows us to set up individual query level monitoring. Do check it out!
    🙌 2
  • r

    Rahul Sharma

    09/04/2024, 10:24 AM
    🚀 Introducing MIDAS: The Intelligent Autoscaler for Druid’s MiddleManager! 🚀 Hey Druid community! 🌟 Ever felt the pain of wrestling with scaling Druid’s MiddleManager service on Kubernetes? I did too—until I built MIDAS (Middle Manager Intelligent Dynamic Autoscaler)💡 You may have tried using HPA(Horizontal Pod Scaling) to autoscale your MiddleManagers, you might have faced issues like scaling down even when tasks are running, or spinning up more MiddleManagers than necessary. 😅 But with MIDAS, you can autoscale your MiddleManagers without any ingestion failures, ensuring that only the resources required to fulfill ingestion requests are used.🕺🏻 Using MIDAS we have cut our Druid infrastructure costs by 30%🤘🏻. Now, we can scale Druid’s MiddleManagers without worrying about unexpected ingestion failures, and we’re paying only for the resources we actually use🥳 Curious about how MIDAS can turn your chaotic infrastructure into a well-oiled machine? Check out my latest article on Druid Autoscaling: Medium 🔗 The best part? MIDAS is open-sourced! 🎉 Feel free to explore, contribute, or just marvel at how it can revolutionize your Druid deployments. Let’s make Druid even more powerful together! 💪. Repository 🔗 Looking forward to your thoughts and feedback! Feel free to connect with me on LinkedIn 🔗 Cheers, Rahul
    🙌 5
  • e

    Evan Rusackas

    09/19/2024, 4:48 PM
    Hey Druid folks While not quite docs, per se, I don't see a more appropriate channel to ask, so here goes: A couple of us PMC members on the Apache Superset project started a little podcast talking to folks from different DB teams. We'd love to have a PMC member or three from the Druid project over for a beer and some chat. DM if interested.
    a
    • 2
    • 1
  • m

    Marcos Maia

    09/22/2024, 10:52 AM
    Hi, I am playing with the learn-druid repo, great stuff BTW, and I am a bit confused on why the port 8888 from the host keeps redirecting to the jupyter notebook instead of the Druid console? When I look at the compose mappings I don't see anything specific that would cause that, any ideas?
    b
    c
    • 3
    • 4
  • p

    Peter Marshall

    09/30/2024, 10:59 AM
    Hey all! Some notebook updates pushed to learn-druid this last week. Thanks @Hugh Evans @Charles Smith for the work on these 🙂 • Tiering • Async query on historical data • Async query on real-time data Thanks also to @Hugh Evans who's been working to spellcheck (!) and generally clean-up the notebooks behind-the-scenes.
    👏 1
    🙌 1
  • d

    Divit Bui

    10/06/2024, 3:32 PM
    Hey all, I'm playing around with single server druid and was wondering if it's possible later down the road to easily migrate to multi-server druid deployments with kafka ingestion (or any streaming ingestion)? I'm assuming the multi-server deployment will just populate the historicals or fetch from the deep storage and the ingestion task will read kafka offsets from the DB?
    b
    • 2
    • 1
  • r

    rishikesh

    11/04/2024, 9:54 AM
    Hey Druid community! 🌟 I'm looking to implement MM-less in Druid 30. Could anyone share relevant links or documentation to get started? Thanks!
    b
    • 2
    • 1
  • s

    Stefanos Pliakos

    11/06/2024, 10:55 AM
    Hey team! Can I ask what is your take on Druid Operator? Is it mature for production use? I am thinking of migrating Druid from EC2 VM instances to Druid Operator as a means for better scalability and cutting down costs, as EC2 instances scaling is not so great. For example, having to scale from r7i.4xlarge to r7i.8xlarge at some point in time represents an almost double in cost while (I hope) with K8s and Druid Operator we can just reserve just some more memory. What is your experience and what do you think about that?
  • z

    Zhenyu Yang

    11/19/2024, 12:44 AM
    hello everyone ,i am a developer from china
    👋 3
  • u

    Utkarsh Chaturvedi

    11/20/2024, 10:34 AM
    Hi Druids! Hope you're all well. I had some questions for druid_processing_numThreads. We are seeing the average CPU for the historical is around 10-20, reaching a max of 30% during the day. The value for the druid_processing_numThreads currently is set to 20 for our 16core, 128Gb box. The docs recommend keeping it at : numcores - 1. But since we're in such under utilisation, I wanted to increase the number of threads to about 40. Was wanting to know, how others have the config for the same. If you have the following numbers kindly comment below with a one liner for your experience of the same. This would specifically be for the case of running separate servers for Druid, not single server deployments. Kindly add any of the following below : CPU cores druid_processing_numThreads Average CPU Utilisation P99 Query time Any major experience upsides/downsides. Thank you guys for your consideration and time.
    j
    s
    • 3
    • 6
  • h

    Hardik Bajaj

    12/03/2024, 12:23 PM
    Hey Team! I have a doubt on TopN queries. If I do a topN query with filters on a dimension to include only two values A and B. Let's say this dimension is a high cardinality. Does adding filters to only include A and B makes now makes sure that answer will always be 100% accurate. That being said, can I now keep the topN threshold value too as 2 and then too there would be no approximate results. So my question is does TopN first reads top K dimension values and then filter it, or filtering of dimension values from Dictionary happen first ? I hope my question makes sense
    j
    l
    b
    • 4
    • 10
  • u

    Utkarsh Chaturvedi

    12/09/2024, 11:34 AM
    Hi All! I was reading the Druid documentation for segment loading and mmap. The docs say that : "The segment cache uses memory mapping*. The cache consumes memory from the underlying operating system so Historicals can hold parts of segment files in memory to increase query performance at the data level*". Followed by "At query time*, if the required part of a segment file is available in the memory mapped cache or "page cache", the Historical re-uses it and reads it directly from memory. If it is not in the memory-mapped cache, the Historical reads that part of the segment from disk. In this case, there is potential for new data to flush other segment data from memory.*" My understanding of this is that each segment is mmap-ed and that a % of each segment is available in RAM (Address space of historical). If the part of the segment required for query is in this page cache, it will get used. If not; the newly accessed data of the same segment will replace the old page cache of the same the segment. Is this a correct understanding? Does this mean that % of each segment cached using mmap is approximately equal for each segment? Does a frequently accessed segment not get an increased mmap-ed percentage? And by increasing the RAM the % of segment cacheable simply increases and hence request speeds are faster?
  • p

    Peter Marshall

    12/16/2024, 7:59 AM
    Resharing here :)
    👍 1
  • s

    Seki Inoue

    12/18/2024, 2:01 AM
    Hello team, Do you still not recommend enabling
    dropExisting
    for compaction config? Can I hear your experience with this option? I want to save our S3 cost but I hesitate enabling this because I don't want to loose any data.
    j
    • 2
    • 4
  • n

    Noor

    01/20/2025, 10:27 AM
    Hi team, is there any way to set the query timeout in supervisor level. I need to set the query timeout in supervisor leve
  • d

    David Adams

    02/06/2025, 10:31 PM
    Hey team, it seems like there may be some incorrect examples in the Druid documentation for handling SQL queries against Multi Value Dimensions. Doc reference here. Based on the documentation, assuming a dataset like this:
    Copy code
    {"timestamp": "2011-01-12T00:00:00.000Z", "label": "row1", "tags": ["t1","t2","t3"]}
    {"timestamp": "2011-01-13T00:00:00.000Z", "label": "row2", "tags": ["t3","t4","t5"]}
    {"timestamp": "2011-01-14T00:00:00.000Z", "label": "row3", "tags": ["t5","t6","t7"]}
    {"timestamp": "2011-01-14T00:00:00.000Z", "label": "row4", "tags": []}
    A query of
    Copy code
    SELECT label, tags
    FROM "mvd_example_rollup"
    WHERE tags = 't3'
    GROUP BY 1,2
    should return an exploded view of all tags within each MVD that got matched (filter applied pre-explosion):
    Copy code
    {"label":"row1","tags":"t1"}
    {"label":"row1","tags":"t2"}
    {"label":"row1","tags":"t3"}
    {"label":"row2","tags":"t3"}
    {"label":"row2","tags":"t4"}
    {"label":"row2","tags":"t5"}
    However, in my V30 instance, the behavior I see is instead returning a result that suggests a post-explosion filter. The results I get returned are effectively as follows:
    Copy code
    {"label":"row1","tags":"t3"}
    {"label":"row2","tags":"t3"}
    Are any gurus able to decipher if this is a simple documentation issue, or is my cluster behaving in an unintended way?
    j
    • 2
    • 1
  • a

    Andrea Licata

    02/21/2025, 5:43 PM
    Hi, I was trying to use the configuration parameter "druid.server.loadSegmentsOnStartup" to avoid loading all segments from deep storage at every historical startup. I noticed that it is not documented, but I found this method on GitHub: https://github.com/apache/druid/blob/9df92230979342d6179c6f8ba94ee9efbfb4983c/server/src/main/java/org/apache/druid/server/coordination/SegmentBootstrapper.java#L172 Does anyone have experience with this?
    k
    • 2
    • 1
  • s

    Sivakumar Karthikesan

    03/26/2025, 6:37 PM
    Hi Team..im setting up druid emitter prometheus , does anyone has sample json for grafana dashboard and expersion ?
  • d

    Dinesh

    03/28/2025, 10:45 AM
    @Sivakumar Karthikesan try this dashboard.
    Pods_New-1707999228368.json
  • a

    akshat

    04/25/2025, 7:02 AM
    🚀 Thinking of going nanoservices in production? It’s not just about smaller services—state management & orchestration are your real battlegrounds ⚔️ 🔍 Real-world challenges 🧠 Battle-tested solutions 🛠️ YAML workflows & rollback patterns 👇 Dive in: https://medium.com/@akshat111111/b4d6c5925b1d