Apache Druid

I have a issue when I load all segments, the historicals crash due to low-memory
I don't have an ideia of how much memory each segments needs, usually I'm running with about ~ 5k segments in 3 servers, this runs fine with 512mb XMS, If i change to 3GB I can load 12k without crashing, but I have no ideia what I should set to be safe

I think MQS have a plan to run queryes direcly on the S3 right? is it ready? I just need some ad-hoc issues

also, It would be nice to be able to run compaction jobs on the unloaded segments - this would be great for compacting late messages that arrive many days after the retention

See the threads just above.  The metadata database might have the numbers you want, about used and unused segments.

Could you provide examples of your queries? If you are doing count distinct, data sketches on `s_id` may be useful

is s_id always a string? can you re-import as string using the index enable and make sure the query planner knows about it eg:
`cast (s_id as varchar) IN (?, ...`) ` for example

<@U04CGQ1P68Z> nothing fancy, simple selects and aggregations

<@U038XPQFJHM> That’s not done automatically? Yes s_id is always a string!

<@U032H7AUY74> what you are referencing to? if you don't defined any column on the ingestion, yes, it's the default for dimensions be imported as string with the bitmap index created `createBitmapIndex` default is true for string, long, float, json and double it's false bc id do not support any indexing aside the segment/time

but what I discovery is that is best to help calcite when writing the queries to cast the dimension to string, just to not depend on the heuristics of planning.

image.png

yes, in the `Pares data` step of the data loader you want to set `Use json node reader` to `True`

if you are using the API then it goes into the `inputFormat` :

This feature was only added Druid 25 (I believe)

Thanks for answering!

So I upgraded to Druid 25, but then, I don't get to the dialog from your first screenshot, because already in the 'connect' tab I get this message:
This data looks like regular JSON object. For Druid to parse a text file it must have one row per event. Maybe look at <http://ndjson.org/|newline delimited JSON> instead.

This is actually a bug in the UI wizard. It is fixed here: <https://github.com/apache/druid/pull/13709> that warning is wrong (it should appear) and it will prevent you from using the rest of the wizard. You have to make the spec by hand and submit it directly until the fix above is merged and released. If you really wanted to use the wizard you would have to pull out a sample of events from your stream. reformat them to be single line json objects, put them in a new kafka topic, run the wizard on that new topic, grab the spec at the end, edit the topic to the original topic, and submit it by hand. So sorry about this!

Okay, that makes sense - Thank you :slightly_smiling_face:

Hi Keren - have you had a play with the `useSharedLock` context parameter introduced <https://github.com/apache/druid/pull/12041|here> - it might be of some use

Thank you Kyle! I was not aware of this parameter, it’s not documented anywhere. I will try it out.

Hey <@U030MBK46BD>
It seems that this usage still is limited to tasks from the same groupID, from what I see in <http://indexing-service/src/main/java/org/apache/druid/indexing/overlord/TaskLockbox.java|TaskLockbox>. I tried playing with the new parameter but the tasks do not seem to be sharing the lock.
My need was to have a native ingestion (append) task and a kafka task share a lock to the same datasource.
Any suggestions?

That's a little hard to answer :stuck_out_tongue:
It is highly probable that some part of the code uses b-tree for some operations
(in the classic query runner) druid segment processing usually uses bitmap indexes first
then the broker may or may not use different approaches to merge the result

<https://youtu.be/f-LLTle-Xug?t=1445> there's a sketch of the algorithm here

I'm looking into it. I'll let you know when access is restored.

<@U030K4UM3H7> if possible can you also provide me the access to this page, i Can't access either

Access has been restored, sorry for the hiccup :smile:

As per <https://druid.apache.org/docs/latest/development/extensions-core/kinesis-ingestion.html>

fetchThreads means

`Size of the pool of threads fetching data from Kinesis. There is no benefit in having more threads than Kinesis shards. Default is (cores -1)

Thanks <@U04CGQ1P68Z> <@U0319HD8HEC> so is the "Size of the pool of threads"  referenced in Brandon's commend defined as "fetchThreads * taskcount" or is it just "fetchThreads"? Basically im trying to see if I should set fetchThreads to = my number of kinesis shards or my number of kinesis shards / # of kinesis tasks

This is what I could gather:
````taskCount` determines the number of `taskGroup`s each of which consumes a disjoint set of kinesis shards. `fetchThreads` shouldn't be greater than the number of shards assigned to a taskGroup. For example, given 20 shards and 10 `taskCount`, each taskGroup will be assigned 2 shards and thus `fetchThreads` should be set to 2.```

Do check out this summit recording: <https://druidsummit.org/#wistia-pt7z1v1bza>

Thanks <@U04CGQ1P68Z>
Will take a look on this

This is something that works in production:
<https://github.com/deep-bi/spark-druid-segment-reader>

You can build an ETL by reading Druid segments in Spark, transform it there and load into your destination tool.

You can download a ready to use .jar from here:
<https://github.com/deep-bi/spark-druid-segment-reader/releases>

Thanks <@U032X0RLS65>, will look into it :pray:
Meanwhile I see following limitation that is not suits our needs
```Only segments with daily granularity are supported.```

Thanks <@U04B91YT1BP> for letting us know. We’re working on that. You can raise an issue on Github if you’d like.

<@U04B91YT1BP> FYI, we just added a new feature that removes that limitation:
<https://github.com/deep-bi/spark-druid-segment-reader/pull/4>

Hi <@U04L8SH8QQP> please have a look at the relative error rate for theta sketches here: <https://datasketches.apache.org/docs/Theta/ThetaErrorTable.html>. As you can see, depending on the specified k value, the error rate varies. This link also has more information on accuracy: <https://datasketches.apache.org/docs/Theta/ThetaAccuracy.html>.

The default size for theta sketch is `16384`  that has a relative error rate of 2.344% for 99.73% data. If you increase the K value the error rate goes down.

Ahh want to try decrease error rate but 1% ~2%  is maximum...
Result range can be 1million ~ 2million so this percent is not affordable

<@U04KRGPGWAU> The default size for theta sketch is `16384`  that has a relative error rate of `2.344%` for `99.73%` data. As can be seen in this document: <https://datasketches.apache.org/docs/Theta/ThetaErrorTable.html> if you increase the size to for example `1048576`  the relative error rate of `99.73%` will go down to `0.293%`

You can increase the size further and the relate error rate will also get better.

Auto compaction uses lower priority locks such that your ingestion will overrule the lock and auto-compaction job will be canceled if they collide.
There is no auto-scheduling of SQL-based ingestion in Druid at the moment, but cron-based should be fine.

I'm curious, how often are you running those batches? Average data size per batch?

We have segment granularity of 1 hour,  intermediate handoff set as 15 mins.
Idea is that we run SQL ingestion every 1hour to replace 1 hour old data
issue with this approach is potentially SQL ingestion might run on the same time segments compaction job trying to compact. From the <https://druid.apache.org/docs/latest/multi-stage-query/concepts.html#overwrite-data-with-replace|doc>
&gt; REPLACE statements acquire an exclusive write lock to the target time range of the target datasource. No other ingestion or compaction operations may proceed for that time range while the task is running.
would that mean compaction waits while SQL ingestion is running on a segment?.

As the time range should align with the PARTITIONED BY clause, It means that we can only REPLACE one hour time range. Our one hour data would be 15mil rows and each row would be 100bytes, that is ~1.5 to 2 GB data per batch.

The compaction job that collides with the REPLACE will be canceled, at the next round of compaction it will try again. 2GB per batch sounds not be a problem with appropriate resources.
Just to understand your approach. You are ingesting from a stream and then following it up with batches? why are the batches needed?

We get the partial records for a dimension "tags" which we need it to be aggregated into one value. My understanding is that compaction wouldn't rollup dimensions, Is that correct. Is there any other approach to aggregate dimensions after ingesting
```dimenstions(_time, id, tags) metrics( duration)

{"timestamp": "2011-01-12T00:00:00.000Z", "abc", ["t1","t2","t3"],  1}  #row1
{"timestamp": "2011-01-12T00:00:00.000Z", "abc", ["t3","t4","t5"], 2}  #row2
{"timestamp": "2011-01-12T00:00:00.000Z", "abc", ["t5","t6","t7"], 3}  #row3
{"timestamp": "2011-01-12T00:00:00.000Z", "abc", [], 4}                #row4

we're expecting 
_time, id, tags (union), duration(max)
{"timestamp": "2011-01-12T00:00:00.000Z",  "abc",  ["t1","t2","t3", "t4","t5","t6","t7"], 4}```


Yes, you can aggregate the tags at ingestion, here's an example using SQL based ingestion with some inline data similar to yours. The data:
```{"timestamp": "2011-01-12T00:00:00.000Z", "trip_types": ["walk","drive","fly"], "duration":6}
{"timestamp": "2011-01-12T00:00:00.000Z", "trip_types": ["walk","bus","walk"], "duration":53}
{"timestamp": "2011-01-14T00:00:00.000Z", "trip_types": ["drive","fly","drive"], "duration":60}
{"timestamp": "2011-01-14T00:00:00.000Z", "trip_types": ["jump"], "duration":0}   ```
The SQL based ingestion query, you'll need to set "Enable Group By multi-value unnesting" to True in the query context:
```REPLACE INTO "inline_data" OVERWRITE ALL
WITH "ext" AS (SELECT *
FROM TABLE(
  EXTERN(
    '{"type":"inline","data":"{\"timestamp\": \"2011-01-12T00:00:00.000Z\", \"trip_types\": [\"walk\",\"drive\",\"fly\"], \"duration\":6} \n{\"timestamp\": \"2011-01-12T00:00:00.000Z\", \"trip_types\": [\"walk\",\"bus\",\"walk\"], \"duration\":53} \n{\"timestamp\": \"2011-01-14T00:00:00.000Z\", \"trip_types\": [\"drive\",\"fly\",\"drive\"], \"duration\":60}  \n{\"timestamp\": \"2011-01-14T00:00:00.000Z\", \"trip_types\": [\"jump\"], \"duration\":0}    "}',
    '{"type":"json"}',
    '[{"name":"timestamp","type":"string"},{"name":"trip_types","type":"string"},{"name":"duration","type":"long"}]'
  )
))
SELECT
  TIME_PARSE("timestamp") AS "__time",
  ARRAY_CONCAT_AGG( MV_TO_ARRAY("trip_types")) AS "trip_types",
  SUM("duration") "duration"
FROM "ext"
GROUP BY 1
PARTITIONED BY DAY```
The table ends up with:
```__time                   trip_types                                 duration 
2011-01-12T00:00:00.000Z ["walk","drive","fly","walk","bus","walk"] 59
2011-01-14T00:00:00.000Z ["drive","fly","drive","jump"]             60```

We ingest real time from Kafka. Is there a way to ingest realtime using SQL based ingestion?

No, not yet, I'm hoping that gets added. I think the equivalent operation might exist in real-time ingestion using rollup. I'll take a look.

<@U03BT5HU5BL> Hey, I have not forgotten about this. I tried to get this to work with real-time ingestion, using an expression in the metrics spec to do the aggregation of the arrays at rollup. It did not work, but this sounds like an interesting use case for streaming. It reported an error once the ingestion task received any message that it does not support ARRAY&lt;STRING&gt; datatype for this operation. Do you mind <https://github.com/apache/druid/issues|creating an issue> with array aggregation use case for streaming ingestion and I'll add comments on my tests?

Thanks, Sure I will and update after the bug is created

Yes. That shouldn't be a problem, you just need to change your retention rules and the cluster default retention rule to reflect the custom tier names,

We're having Apache Druid Office Hours at 10AM PST. Join me for this informal conversation where we talk with users about their use cases and provide troubleshooting advice. <https://www.meetup.com/druidio/events/290102175?utm_medium=referral&amp;utm_campaign=share-btn_savedevents_share_modal&amp;utm_source=link>

We are back with the druid drop in APAC. Please do join if you want to chat about druid,, use cases and trouble shooting <https://www.meetup.com/hyderabad-apache-druid-meetup-by-imply/events/291133413?utm_medium=referral&amp;utm_campaign=share-btn_savedevents_share_modal&amp;utm_source=link|https://www.meetup.com/hyderabad-apache-druid-meetup-by-imply/events/291133413?utm_m[…]tm_campaign=share-btn_savedevents_share_modal&amp;utm_source=link>

I'm not sure if its officially recommended anywhere, but we've been running druid on graviton 2 instance types for about a year and half in production. it works quite well

<https://aws.amazon.com/solutions/case-studies/zomato-case-study/?trk=public_post_main-feed-card_reshare_feed-article-content>

Sounds really cool. Perhaps you should also post this on the <#C030CMF6B70|> channel.

Sounds awesome. Once you put the Java code together and have the supervisor API going, I can help guide you in adding support for Pravega in the web-console

Nice. you should start looking at classes in druid-kafka-indexing-service. There are two pieces in the streaming ingestion
• supervisor - this is like a controller. if you are coming from yarn, this is like an application manager. if you are coming from spark, this is like a long-running driver. You would have to extend `SeekableStreamSupervisor` class for your own extension and override the methods depending on how pravaga works. 
• index task - This is the task that does the heavylifting of processing messages and publishing segments. The class that you need to extend is SeekableStreamIndexTask. 
you should look at these abstract classes first. These classes has certain concepts such as offsets, partitions, task assignment. If pravaga is a system similar to kafka from an API perspective, then it would be easier to build this extension.

<https://github.com/apache/druid/pull/11223/files> - someone tried to do it for pulsar as well

<@U038YUVK9L5>
Thank you for the tips! We will be sure to look at those classes and try to understand them.

So far, we our work in PravegaEventSupplier.java, this was originally named KafkaRecordSupplier.java. We started our work here because there is a method named poll() which grabbed records from Kafka. We modified it to instead grab events from Pravega. The events are placed into a `List&lt;OrderedPartitionableRecord&lt;Integer, Long, KafkaRecordEntity&gt;&gt;` We modified this list by making it support ByteEntity objects instead because Pravega events can be represented in bytes. We've been following this helpful page highlighting the pravega api if you're interested!

<https://cncf.pravega.io/docs/latest/javadoc/clients/index.html>

Yes. You should be able to specify the <https://druid.apache.org/docs/latest/operations/rule-configuration.html|Retention Rules> for that data source such that it has `replicants=2` on `tier1`. The coordinator will decide which two historical hosts will have each of those 2 replicas within tier1 historicals. You should use a load rule ( loadForever, loadByPeriod, etc) and not a broadcast rule that will push replicas to all nodes.

To clarify though if it has more than 1 segment it will be spread out across all of them in a balanced manner. The replication for each segment will be spread across 2 nodes but the segments them selves will be spread out across all historicals in the tier.

Outside of tiering I am not aware of any way to indicate which datasource segments should load on which historicals.

Was this an accident or are you trying to clear the contents of the historical?

I just tested this on a standalone deployment.
• I removed the info_dir directory
• I waited for a few minutes and nothing happened on the rest of the directories
• The info_dir directory did not appear again.
I restarted all services:
• after a little bit, it rebuilt the info_dir directory and all of its content
• since this is a single historical, it wasn't clear if this was just the coordinator asking it to rebuild or if the historical rebuilt based on the content of the segment-cache.
• I looked at the logs:
    ◦ coordinator log - does not show any new segment assignment
    ◦ historical log - shows Loading and announcing segments 
I suspect you will see the same, the info_dir content will be recreated after you restart the historical.

Hi Team, I am trying build the Tableau dashboard on top of druid datasource. Issue is we are getting the live connection for workbooks in tableau desktop but if we publish that workbook it is taking the extract connection on tableau online (Not showing the option for live connection). Anyone faced this type issue? If so how to solve? Thanks... :slightly_smiling_face:

I am getting Java lang buffer memory error

Can you be more specific ? Is this happening during querying or ingestion? Any relevant log files ?

Can you share the indexer/MM log files and the task logs?

dry reducing max rows in memory. What is your druid.indexer.fork.property.druid.processing.buffer.sizeBytes?  You can set this to a max of 2 GB (make sure you also increase -XX:MaxDirectMemorySize in druid.indexer.runner.javaOptsArray. Basically you can set the buffer size to max of 2 GB and if you still get the error then reduce max rows in memory and also set create bitmap index to false for string fields you will definitely not be using in a filter.

this is very cool! Thank you for the writeup

I'm hoping for opinions and thoughts from all of you.
The topic: Schema Flexibility
What if you didn't have to worry about specifying a schema at ingestion?

While Druid has some schemaless characteristics, it isn't fully there yet.
- Current schemaless ingestion by using an empty `dimensions` list will currently identify columns as string.
- Nested Columns enable semi-dynamic schemas where the fields nested in the ingested object are automatically parsed into columns and give them proper data types. But this is only for nested columns.
So the question is, do you see value in extending the automatic schema detection with the right data type to all columns?  What features do you think this should have?

Two use cases come to mind:
- new data / POC : just through some data into Druid and query it
- schema evolution: by auto detecting at every ingestion, there is no need to maintain ingestion specs as columns appear, disappear and change in type.

Any other use cases?

hi <@U04KRGPGWAU>,
Not helping with your original question, but I suggest you not only check that indexing task completed, but also check that segments successfully loaded.
We've seen cases from time to time (with hadoop indexing) that eventual segments aren't named correctly - and you end up in situation where you think indexing worked but you can't actually query all data.

To have this be reliable, I would just create a consumer worker that reads from rabbitmq and push to either Kafka or Kinesis. You can use Flink for this as well.

Tranquility is no longer being worked on, it has been replaced by kafka and kinesis ingestions which provide lossless  exactly once guarantees.

We are currently using kafka, but are looking to reduce the cpu/memory footprint and operational complexity. If it's going through rabbitmq first the exactly once guarantees don't exactly accomplish anything anyways.

If you did patch Tranquility to work with newer version of Druid. please open source that work, my team could definitely use it as well :wink:

Have you looked at Redpanda? It is Kafka-compatible and claims to differentiate from Kafka in the ways you mentioned (hw footprint, operations)

I haven't tried it myself, but if you're interested in staying compatible with the Kafka ecosystem, that (and various cloud-hosted Kafka-compatible options) could be worth a look

Redpanda does seem interesting. I'm already partway through making a rabbitmq-stream ingestor. If that doesn't work out, it seems like this would be worth a shot.

At ingestion time, if you create a sketch from a column and that column is not included in the ingestion, then yes, the detailed data is "lost". You can ingest the raw data directly and still take advantage of approximations at query time. The difference is performance, if the data is partially aggregated at ingestion, the queries on it will be faster.

To add, usually people want to throw away the raw data and only store the sketch, to save space.  For strings in particular, it _might_ make sense to store both the value and the sketch.  For some details, see <https://intertubes.wordpress.com/2020/10/14/on-using-theta-sketches-with-druid/> .

Delta files are just parquet files. So you can ingest using druid batch ingestion. Delta files usually have a manifest file that list the files that have been changed in upsert operations. So the usually approach is to periodically ingest using the latest manifest. However in your case you probably want to ingest all the files.  The main issue I see is that you will have duplicate data as updates are not possible in druid at present. You will need to use latest/earliest in your ingestion and in your queries to handle that

Hi Ashok, this operation can be performed using a Druid datasource as the input to an ingestion - either using Native batch (<https://druid.apache.org/docs/latest/ingestion/native-batch-input-sources.html#druid-input-source>)  or MSQ <https://druid.apache.org/docs/latest/multi-stage-query/examples.html#insert-for-reindexing-an-existing-datasource>
You might be able to bring in additional data by combining the Druid datasource and an external source with a JOIN as shown in <https://druid.apache.org/docs/latest/multi-stage-query/examples.html#insert-with-join>

Out of curiosity, what will be in the new column?  Are you adding new, separate data to the existing data?  If so, how would you match it up?  Or is it something derived from the existing, or what?

yes adding new column in the copied data-source.

now requirement is to add additional (new) column to the existing data-source

which will have values for fresh streaming data / new insertion

For adding new column, If you just update the SPEC and add the new column and resubmit it Druid should be able to fetch the new column

and once new data is available from kafka side druid will have the new data in the new column

Sahoo I checked as you mentioned above it working fine. I added new column and for new insertion am getting the value to the added new column.  Is it possible to do update (similar to mysql) to the existing / previous records ?

sorry using kafka stream  (not using DSQL / native query)

Update is not possible i believe, you need to reingest as mentioned in this thread earlier

The Kafka and Kinesis ones are very similar: they share 90% of the same code through common superclasses

I know how to model data that's "an event happened at time T" but not this.

This sounds like a semi-additive measure (measures that don't add up over time).

In Imply's Pivot analytics query tool there is a function called PIVOT_NESTED_AGG() that handles this fairly elegantly ...

But without that function all I can think of is to nest two aggregates, one within the other, e.g.

```select dim1, sum(avg_conn) avg_conn
  from (
        select dim1, dim2, avg(conn) avg_conn
          from table
         group by dim1, dim2
       )
 group by dim1```
Let me know if that isn't what you are looking for.

Thanks.  John

this blog talks about how the PIVOT_NESTED_AGG function achieves this by issuing a nested query to Druid <https://blog.hellmar-becker.de/2022/07/31/using-imply-pivot-with-druid-to-deduplicate-timeseries-data/>