Apache Druid

Creates a dynamic property bag object from a list of keys and values like kusto bag_pack() function and may I know related function in druid ?

the syntax for `STRING_AGG` is
```STRING_AGG(expr, separator, [size])```

or
```STRING_AGG(DISTINCT expr, separator, [size])```

1024 is the default value for the [size] parameter

is any config option there to change 1024?

unfortunately druid doesn’t currently support “growable” buffer aggs, so you sort of need to have an idea of how many bytes the strings which will be combined together will take up in memory per group

you can specify it as the optional 3rd parameter

e.g. `STRING_AGG(DISTINCT x, ',', 32768)` or whatever size you need

i tried by giving 2000...  looks like not working

keep in mind that each grouping in the aggregate will need that many bytes, so be mindful of the amount of direct memory available to the process

it doesn’t know how many bytes there might be total, it fails fast

2000 should have given a slightly different error message than the default

you might need to adjust up until you find the number of bytes needed, and if the column has a very high cardinality it might not be possible

Thanks Clint, by increasing to 2000 working...  i were tried with DISTINCT approximation as flase so different error came

once I enabled it, this STRING_AGG() with 2000 higher than default 1024 byte works

or is any alternate way there to STRING_AGG ?

I tried within the SQL query as follows  STRING_AGG( expr, ' - ', 2000)  but not working

Hey Guys. We currently have 50 historical nodes with 1 million segments. We want to replace them with new historical nodes(aws ebs and config changed). How can we safely remove these 50 historical? I found use coordinator decommissioning is too slow, should I increase Max segments to move and change coordinator balanacer strategy?

Screenshot 2023-06-21 at 10.56.40.png

image.png

You can change  replicant count from datasource&gt; load rule  , by default replica is 2 so if you have  two historical then they will replicated on two historical. You can add new rule specific to datasource and change replication . Note editing defaults will change for all datasources .

Still, if i shutdown one data node everything is inaccessible...

What error you are getting when you try to query ?

The segment seems to be only on one node 

When does replication take place? In specific periods or instantly 

coordinator will take care of replication . do you see  any errors in coordinator logs and historical that is running.
<https://druid.apache.org/docs/latest/design/coordinator.html#segment-availability>

For some reason I was able to fix it by setting replicas to two in my data source ingestion config. Is that what I was missing? I'm just checking as i'm not sure if that is the right thing to do or maybe there's another solution.

you  can also set from the replica from retention rule as well

Well, it only worked after i edited ingestion config for some reason. May I also ask what is now happening with segments after they are no longer in use? Are they always stored on both servers or only while being used. 

With druid you always define load rule with (rentienton) how data can be stored on druid and rest will be dropped. Only spcified period of data will be loaded by historical .
<https://druid.apache.org/docs/latest/operations/rule-configuration.html>

By dropped you mean fully removed or just moved out of druid fast access cache? Thank you very much for your help so far btw :)

it will be dropped from  druid historical   server but segments will be  there in deep storage  until you fully remove from deep storage as well.

Now, for instance, I have the following situation

Only the active (real-time) segment is replicated and always available

Every other segment is inaccessible during downtime

Just to make sure, have you checked the docs here?

<https://druid.apache.org/docs/latest/operations/rule-configuration.html>

My understanding is "BeforeBy" means anything older than the timeframe specified, and "By" means anything newer than the timeframe specified.   "Drop" rules have "BeforeBy" and "By" variations, whereas "Load" only has a "By" variation.

The rules are evaluated in the order listed in the ruleset.

Assuming you have listed each ruleset on one line and comma-separated the rules in their order of evaluation, then I interpret your first two rulesets as:

1. Drop anything order than 580 days old, otherwise load anything 2 years old or newer, including future dates.  If you didn't need future dates then you could have just used LoadForever as the second rule .... but since LoadForever doesn't have an "include future" option you have to do it the way you wrote it.  You could have used loadByPeriod (P580D+future) but the additional overlap doesn't affect the behavior since this is the lower priority rule.
2. Load anything 70 days old or newer, including future dates ... otherwise drop everything older than 90 days.  Not sure what happens to the data between 70 and 90 days ... I think the rulesets always have a "LoadForever" as the last rule that you cannot delete, so I would assume everything between day 70 and 90 would also get loaded.
The other two rules follow a similar pattern of interpretation.

Notice my different phrasing for the BeforeBy and By options ... I am assuming that BeforeBy means prior to the period specified, and By includes the time period specified.   So, for example:
• BeforeBy P1D means before yesterday
• By P1D means yesterday or later
I am not 100% sure of this though, would welcome confirmation.

John, BTW
we have configured the above rules in the below areas.  .
1. Edit retention rules 
2. coordination dynamic config from the console level
But data deletion was happening as expected
Do we need to configure any other areas apart from the above to automatically delete the Deep storage data/segment deletion ?

I believe the retention rules only determine whether or not segments get pre-loaded onto Historicals or not ... the segments still exist in deep storage unless you issue a "kill" task.

Look at the bottom of the page for the link Daniel provided ... it addresses permanent deletion of data.

Hey Guys,

We configured the below properties
```druid.coordinator.kill.datasource.on=true
druid.coordinator.kill.maxSegments=100```
as per the below reference , we have configured these properties in coordinator server -runtime. properties file
/opt/druid/conf/druid/cluster/master/coordinator-overlord/runtime.properties

Apart form configured properties do we need to set any additional properties in order to automate the kill operation?

<https://support.imply.io/hc/en-us/articles/360035308573-Deep-storage-data-segment-deletion-using-coordinator-kill>

But the document suggesting Once the above properties were placed , need to restart the master nodes , I am not sure about this point  What are the master nodes we need to restart? or just restart the master services are sufficient.

But in the below reference no where specified need to restart the master nodes.

<https://imply.io/blog/apache-druid-recovering-dropped-segments/>

Hey guys, please share me your suggestions.

To pick up a properties change on Coordinator or Overlord, I think you have to restart all master nodes, one at a time to allow proper failover if you happen to restart the leader node.

I have restarted all the master nodes. but datasource has been wiped out.

it is possible rebuild the datasource on top of the existing deep storage data from hdfs.

What are the retention policies on this datasource?  It is possible you have set the retention to "drop" all time chunks ... in which case the segments are still there, just not loaded for querying.

Screenshot 2023-06-23 at 6.06.49 AM.png

On the Datasources screen of web console, click the "Show Unused" selector to see if the datasource is still there, just not loaded:

Well no one here is restarting their cluster to see aggregations or transformations working, but maybe a bit more context?

Are you running batch (s3, hdfs) or streaming (kafka, kinesis) ingestions?
Are these aggregations rollup jobs? How are you setting them?
Are you appending or replacing data?

Maybe you could provide a ingestion spec sample so we could understand the job, the transformations.

This is just a strictly "does Druid offer this" type question <@U04L6JH4H2M>

For Druid the question may be more related to Broadcast tables and Lookups following a cluster restart ... I'm not sure if this just works the same way as Historicals loading segments (i.e. they are loaded immediately upon cluster restart once the Coordinator gets things sorted out).

Did you build this ingestion via web console? If yes, could you see your data in the sample?

Do you have number of partition:number of task as same ratio. like 1:1, If  you do not then kafka might be emitting faster than druid is consuming because of task count less than number of partition in kafka topic

When you say "not getting the latest data" do you mean that it's ingesting it slowly but the data takes a long time to get in? Or do you mean that nothing is getting in at all?

If it's slow, you might take a look at your Kafka lag in the supervisor stats and see if that's growing.

Also, if your timestamps aren't right, it can sometimes look like data is coming in late. I've seen a misconfigured server send timestamps in the past, so everything was late. Compare the Kafka metadata timestamps to the timestamps in the data to see if they're close.

Also, take a look at the data you're bringing in. If it's flat structures with a few values, it should zoom.

If each Kafka message is several megabytes in size, then maybe you need to rethink some design decisions.

If you're getting nothing at all, then you'll need to check your logs. One thing that can cause this is if the offset that Druid wants to use has been deleted by Kafka's retention policy. Druid can get stuck in a cycle of trying to get data that's already gone. In that case, you might need to reset your supervisor. Be really careful with that though, it's very easy to lose or double your data.

This is WIP <https://github.com/apache/druid/pull/10920>

Thanks a lot <@U030MBK46BD>. Will go through the details you have shared.

Is it possible to create a string metric with set semantics?  Something like an HLL sketch but being able to pull the actual string values stored?

I could set a string dimension field but for performance reasons I would rather just append to the set and pull the entire aggregated set in a single row. In my use case the string I want to add has a high cardinality (&gt;1 million) and the number of rows (and segments) would explode if I added it as a dimension.

Hey VInod! There are quite a few examples out there - check out the videos on <https://druidsummit.org> - and you can also see Druid observing itself (!) with Prometheus and Grafana in the Metrics course at <https://learn.imply.io> by Imply

I thought it was done asynchronously through the `druid.coordinator.kill` property settings ... I don't see any references to kill actions in the compaction task log.

Yes, that's right. I have compacted a few datasources.. let's say a datasource with 'n' uncompacted segments. After compaction, I still have n+1 objects in S3 deepstorage corresponding to those uncompacted segments. So, just wanted some information on if there's anything that I need to do to hint Druid to delete the uncompacted segments

This is exactly the problem I'm referring to -
<https://github.com/apache/druid/issues/9755>

<https://github.com/FrankChen021|FrankChen021> has described the problem with absolutely zero ambiguity.

yes, that should go to `thrownAway` count in the logs and will not be kept inside druid.

Ex:
```[X]events thrown away. Possible causes: null events, events filtered out by transformSpec, or events outside earlyMessageRejectionPeriod / lateMessageRejectionPeriod.```


Then I think collecting all the data and throwing it away is inefficient. Therefore, I would like to collect at once and divide the collected data to have each data source. What should I do?

you may collect it without filter from kafka to datasource1 and then use datasource1 as source and filter the required data to new datasource.

<@U03R5H3B750> You seem to be referring to this function, is that correct?

I don't know that this limit can be changed at query time.

My understanding is that the limit is used for broadcast objects (e.g. Lookups) as well as loading secondary tables and intermediate results for join operations.

I have seen it raised as high as 1m without causing problems, but I think it is highly dependent on how much heap memory you have to spare on your cluster.

In the cluster tuning doc page <https://druid.apache.org/docs/latest/operations/basic-cluster-tuning.html|HERE> there is some mention of heap sizing for Lookups ... maybe use that as reference.

or I can do increase this limit while requesting the query API in real time ?

are you using MM-less ingestion extension?

I believe this is expected as per the guide, Could you please try setting `useGroupingSetForExactDistinct`

```When useApproximateCountDistinct is set to "false", the computation will be exact. In this case, expr must be string or numeric, since exact counts are not possible using prebuilt sketches. In exact mode, only one distinct count per query is permitted unless useGroupingSetForExactDistinct is enabled.```
<https://druid.apache.org/docs/latest/querying/sql-aggregations.html>

Thank you , it’s solved my problem, I think I know the reason, thank you again!!

Hi Dejan, I can't answer for specific directory paths, but  I can respond to some extent to your questions:

1. The Kafka (streaming) ingestion engine runs the data through three stages:
    a. When the data first comes from Kafka it is parsed to understand the schema and then it is appended to an unindexed row buffer in JVM heap, here it can immediately service queries.  `maxRowsInMemory`,  `maxBytesInMemory` and `intermediatePersistPeriod` limit how much data can accumulate here before being "persisted" to local disk
    b. Data from the row buffer is periodically compiled into a full segment format (columnar, dictionary encoded, compress) and stored as a "mini-segment" or what we call "intermediate persist", file, which is still located on local disk for the MM that is ingesting it.  Many of these files will likely be created before building and publishing a "real" segment to deep storage.   The data in phases 1 and 2 above are both considered as "real-time segments".
    c. When the amount of data accumulated reaches a larger threshold (e.g. `maxRowsPerSegment`), all of the intermediate persist files will be compacted into a single "real" segment and pushed to deep storage.  This process is generally called the "Handoff", transferring the segment from real-time to historical status.
2. Ingestion node failure.
    a. If you have ingestion replicas set up they both operate in tandem, syncing to the same Kafka offsets, so when one replica dies the other one can continue on.  If they are both operating without error then when the first replica finishes publishing a segment the second one is told to abort that segment, but both replicas would continue ingesting the next segment.
    b. If you do not have ingestion replicas (i.e. only one task) then if it fails then at worst case it will have to go back to the Kafka offset related to the last published segment, and start over again from there.  There may be an intermediate savepoint related to the intermediate persist files stored on the MM local disk, but I don't know enough about that part of the ingestion to say whether the recovery can pick up from that point ... someone else will have to chime in here :slightly_smiling_face: 
3. Someone else will have to clarify this one too ... I don't know for sure what happens if the cluster loses contact with Deep Storage.  
    a. Ingestion -- if the ingestion tasks cannot publish a segment to deep storage I imagine you will see ingestion task failures resulting in Kafka lag and retries
    b. Historicals have deep storage segments loaded onto their local disk, so I don't know if they can continue to service queries even if they cannot contact deep storage anymore.  
Hoping someone else can fill in the gaps here.

Thanks.  John

Thank you very much for your reply and detailed info John! I really appreciate it and indeed hope someone would be able to expand on stuff you were not able to help with.

Meanwhile, may I just ask you about segment-cache. Is that what you're referring to as "mini-segments"? <@U04C4593VDY>

Hi Dejan,  "segment-cache" can mean one of two things:
• On the Historical side, when published segments are loaded from Deep Storage into the historicals, sometimes people call that "caching" the segments.   But more likely it is ...
• The Coordinator and Broker nodes both cache a list of metadata for all segments that are "active" in the cluster.  This uses heap memory for both Coordinator and Broker processes and should be included in the memory sizing calcs on the Basic Cluster Tuning page.
By "mini-segments" am referring to the ingestion-side, intermediate persisted files ... these are stored in columnar, indexed and compressed format, just like a real segment, but they have not been combined and published yet, so they are still considered to be part of the "real-time segment" category.

What you have posted above is the supervisor status.
Suggestions:
Check the task status under the supervisor. Usually task status reveals more details on why the task is being marked as UNHEALTHY.

<@U0316GKJR0R> this is my task status
error it says could not allocate the segments
```{
  "id": "index_kafka_eber_vehicles_gps_1b7f49d6d2b1b16_gfpphkcj",
  "groupId": "index_kafka_eber_vehicles_gps",
  "type": "index_kafka",
  "createdTime": "2023-06-27T04:37:31.966Z",
  "queueInsertionTime": "1970-01-01T00:00:00.000Z",
  "statusCode": "FAILED",
  "status": "FAILED",
  "runnerStatusCode": "WAITING",
  "duration": 19170,
  "location": {
    "host": "10.101.56.213",
    "port": 8101,
    "tlsPort": -1
  },
  "dataSource": "eber_vehicles_gps",
  "errorMsg": "org.apache.druid.java.util.common.ISE: Could not allocate segment for row with timestamp[2023-06-26T..."```


Could you check the overlord logs for errors related to lock acquisition and segment allocation in general?

Is kafka ingestion running on an interval in a datasource which already has segments with a different granularity than the segmentGranularity specified for the streaming ingestion job?

below are the error im geeting

1- coordinator
```2023-06-27T11:05:55,032 ERROR [qtp1286172885-129] org.apache.druid.indexing.common.actions.SegmentAllocateAction - Could not allocate pending segment for rowInterval[2023-06-27T08:33:04.624Z/2023-06-27T08:33:04.625Z], segmentInterval[2023-06-26T00:00:00.000Z/2023-07-03T00:00:00.000Z].
2023-06-27T11:06:23,854 ERROR [qtp1286172885-127] org.apache.druid.indexing.common.actions.SegmentAllocateAction - Could not allocate pending segment for rowInterval[2023-06-27T08:33:04.624Z/2023-06-27T08:33:04.625Z], segmentInterval[2023-06-26T00:00:00.000Z/2023-07-03T00:00:00.000Z].
2023-06-27T11:06:43,555 ERROR [qtp1286172885-123] org.apache.druid.indexing.common.actions.SegmentAllocateAction - Could not allocate pending segment for rowInterval[2023-06-27T08:33:04.624Z/2023-06-27T08:33:04.625Z], segmentInterval[2023-06-26T00:00:00.000Z/2023-07-03T00:00:00.000Z].
2023-06-27T11:07:21,343 ERROR [qtp1286172885-144] org.apache.druid.indexing.common.actions.SegmentAllocateAction - Could not allocate pending segment for rowInterval[2023-06-27T08:33:04.624Z/2023-06-27T08:33:04.625Z], segmentInterval[2023-06-26T00:00:00.000Z/2023-07-03T00:00:00.000Z].
2023-06-27T11:07:50,156 ERROR [qtp1286172885-158] org.apache.druid.indexing.common.actions.SegmentAllocateAction - Could not allocate pending segment for rowInterval[2023-06-27T08:33:04.624Z/2023-06-27T08:33:04.625Z], segmentInterval[2023-06-26T00:00:00.000Z/2023-07-03T00:00:00.000Z].
2023-06-27T11:08:19,541 ERROR [qtp1286172885-147] org.apache.druid.indexing.common.actions.SegmentAllocateAction - Could not allocate pending segment for rowInterval[2023-06-27T08:33:04.624Z/2023-06-27T08:33:04.625Z], segmentInterval[2023-06-26T00:00:00.000Z/2023-07-03T00:00:00.000Z].
2023-06-27T11:08:48,454 ERROR [qtp1286172885-122] org.apache.druid.indexing.common.actions.SegmentAllocateAction - Could not allocate pending segment for rowInterval[2023-06-27T08:33:04.624Z/2023-06-27T08:33:04.625Z], segmentInterval[2023-06-26T00:00:00.000Z/2023-07-03T00:00:00.000Z].
2023-06-27T11:09:16,756 ERROR [qtp1286172885-139] org.apache.druid.indexing.common.actions.SegmentAllocateAction - Could not allocate pending segment for rowInterval[2023-06-27T08:33:04.624Z/2023-06-27T08:33:04.625Z], segmentInterval[2023-06-26T00:00:00.000Z/2023-07-03T00:00:00.000Z].
2023-06-27T11:09:36,043 ERROR [qtp1286172885-155] org.apache.druid.indexing.common.actions.SegmentAllocateAction - Could not allocate pending segment for rowInterval[2023-06-27T08:33:04.624Z/2023-06-27T08:33:04.625Z], segmentInterval[2023-06-26T00:00:00.000Z/2023-07-03T00:00:00.000Z].```
2- middle manager

```2023-06-27T11:06:44,471 ERROR [forking-task-runner-0] org.apache.druid.indexing.overlord.ForkingTaskRunner - Process exited with code[2] for task: index_kafka_eber_vehicle_components_status_38c7c2996ed5932_cmmlohne
[ec2-user@```

&gt; Is kafka ingestion running on an interval in a datasource which already has segments with a different granularity than the segmentGranularity specified for the streaming ingestion job?
Could you confirm if this datasource has other segments for this interval?

Also I think you are using segmentGranularity of WEEK. Please avoid it

You should never do more than 1h segments granularity it's the window that has to be replayed from kafka on failures. Run a compacting job to merge the segments later.

i've fixed that now most of them are in running state but can see error for what

task error status
```{
  "id": "index_kafka_eber_gateways_sensors_data_5a0ebe22f44a3f5_cepalohc",
  "groupId": "index_kafka_eber_gateways_sensors_data",
  "type": "index_kafka",
  "createdTime": "2023-06-28T06:53:01.032Z",
  "queueInsertionTime": "1970-01-01T00:00:00.000Z",
  "statusCode": "FAILED",
  "status": "FAILED",
  "runnerStatusCode": "WAITING",
  "duration": -1,
  "location": {
    "host": "10.101.60.160",
    "port": 8100,
    "tlsPort": -1
  },
  "dataSource": "eber_gateways_sensors_data",
  "errorMsg": "The worker that this task was assigned disappeared and did not report cleanup within timeout[PT15M]...."
}```
and i can see error on my coordinator node as we

```2023-06-28T04:05:45,454 ERROR [qtp1286172885-118] org.apache.druid.indexing.common.actions.SegmentAllocateAction - Could not allocate pending segment for rowInterval[2023-06-27T11:15:20.599Z/2023-06-27T11:15:20.600Z], segmentInterval[2023-06-26T00:00:00.000Z/2023-07-03T00:00:00.000Z].
2023-06-28T04:48:58,796 ERROR [Coordinator-Exec--0] org.apache.druid.server.coordinator.rules.LoadRule - Tier[_default_tier] has no servers! Check your cluster configuration!: {class=org.apache.druid.server.coordinator.rules.LoadRule}```

&gt; ```2023-06-28T04:05:45,454 ERROR [qtp1286172885-118] org.apache.druid.indexing.common.actions.SegmentAllocateAction - Could not allocate pending segment for rowInterval[2023-06-27T11:15:20.599Z/2023-06-27T11:15:20.600Z], segmentInterval[2023-06-26T00:00:00.000Z/2023-07-03T00:00:00.000Z].```
You are still using segment granularity of WEEK

I suspect you have segments with MONTH or some other granularity for the datasource+interval you are ingesting to

&gt; ```2023-06-28T04:48:58,796 ERROR [Coordinator-Exec--0] org.apache.druid.server.coordinator.rules.LoadRule - Tier[_default_tier] has no servers! Check your cluster configuration!: {class=org.apache.druid.server.coordinator.rules.LoadRule}```
Could you share screenshot of your servers tab from the druid console?
Eg:

Screenshot 2023-06-28 at 1.17.10 PM.png

process 1 was the right process. I wasn't choosing the right threads from flame graph.

<https://druid.apache.org/docs/latest/ingestion/tasks.html#kill> -&gt; The coordinator deletes unused segments by submitting kill tasks

you can check full log at "Logs" from task.

Sure. In my helm templates I created the following ConfigMap:
```apiVersion: v1
kind: ConfigMap
metadata:
  labels:
    app: druid
    release: druid
  name: druid-metrics-file
  namespace: "{{ .Values.namespace }}"
data:
  metrics-dimensions.json: |
    {{ toJson .Values.metrics | indent 4 }}```
I have a `values.yaml` file with all the config, among them the metrics:
```metrics:
  query/time:
    dimensions:
      - dataSource
      - type
    type: timer
    conversionFactor: 1000
    help: Seconds taken to complete a query.
  query/bytes:
    dimensions:
      - dataSource
      - type
    type: count
    help: Number of bytes returned in query response.
...```
You might ask why I have metrics as yaml in values.yaml file. Reason is <https://stackoverflow.com/questions/73468386/how-to-add-json-data-in-configmap-creation-in-argocd>

I then added a volume in each deployments (broker, coordinator and router) as well as statefulsets (historical and middleManager):
```volumes:
        - name: metrics-conf
          configMap:
            name: druid-metrics-file
            items:
              - key: metrics-dimensions.json
                path: metrics-dimensions.json```
Then you need to mount the volume:
```volumeMounts:
            - name: metrics-conf
              mountPath: /opt/druid/conf/druid/metrics-dimensions.json
              subPath: metrics-dimensions.json
              readOnly: true```
And finally the env var in the container:
```- name: druid_emitter_statsd_dimensionMapPath
  value: "/opt/druid/conf/druid/metrics-dimensions.json"```
Hope it helps!

Hi Siddharth, how is this additional data provided, and how many records of data is it, and how does it correlate to the other data you are ingesting?

If the data is very small you can use cut-and-paste to ingest small amounts of data directly on the console. Otherwise it would likely need to be ingested in one of the available file formats.

But ultimately it would need to be ingested into a datasource (table) or lookup (key/value map).

Let's say a user has a file called Health.csv which contains data of patients for the year 2019 from a specific hospital. The "Description" would be a string that would be something like "Data for Patients from Some Example Hospital", and the "Year" would be "2019".

I also want this data to be removed if I remove the related data sources from Druid.

Okay so you are asking for metadata for the datasources you are creating.  I don't know of anything other than creating a separate datasource to hold this metadata.   Two ways I can think of offhand:

• If you create a single "metadata" datasource for the entire cluster, this would have one record of metadata per datasource, and if you wanted to update or delete any records you would have to run a quick reingestion job on the entire datasource to make the change.  This would run quickly, could be driven via API if you want ... and could be done via SQL using the MSQ API.
• You could also create a separate datasource corresponding to each datasource to hold the metadata for that one datasouce ... this would have only only record in it, maybe you name it in a convenient way (e.g. "&lt;datasource_name&gt;_meta") which makes it easy to delete when you delete the corresponding datasource.
Would either of those work for you?

it's kinda hard, there's a few discussions here at but I don't know if they are still up or if slack already removed them

basically, druid is HA in the same DC
if your data is on S3, and you consider S3 as external or already hyper HA, then that's "done", or have a way of replicating the segments ASAP

druid (usually) uses ZooKeeper to keep track of it's peers, and it needs a central database (mysql or postgres) to keep track some information
this database definitely should have it in HA, even if in the same region

having "two" druid clusters in two DC, and joining the same ZK, will turn into a big cluster, and if you have the database set up as well to HA in another region, maybe that's already enough to handle most failures that can happen
but it's a lot of moving pieces and you need to think about each one of them, eg: is the kafka also HA and have good network to all regions, when the MM task fails it re-read al data since the last offset saved in the database (the coordinator election also would had to be run)

in any scenario, when an entire region where a elected druid-supervisor goes down, I expect at least ~ 3 minutes of downtime before druid resync everything with ZK and start new peons (tasks that the MM will start to sync the missing data in the other DC) and that is considering that the database (metastore) was not affect at all

Another option might be to simply load your data into two clusters, one in each DC.

Thanks for prompt response <@U038XPQFJHM> &amp; <@U0411DE2SS0>. Yeah it does look complicated. Wonder if doing something simple as SAN replication would do the trick - so in essence 2 clusters - one in each DC with data replication done via SAN replication or cluster to cluster replication - if this function exists. Data ingress to both clusters and have them running independently in each DC

I think normally people don't use SAN much for low-latency distributed databases. Although I suppose it could be done, if response times are good.

I agree, I would not see SAN used here
Maybe if you are using MinIO as a backend for the object-storage [segments], and use the rancher *Longhorn* for backup/distribute between multiple DC's
But I think it's just easier to design your system to either accept that druid may have brief moments of unavailability, or consume the data twice in two kinda of 'split brain' scenario, but if you are using kafka as a source, you can have pretty much a guarantee that the data will be the same if you have good process when updating both druid ingestion spec at the same time

Hi <@U05E7LJ94NB>, I am with the DevRel team at Imply and close to your timezone. Please let me know if I can help you!