Apache Druid

:mega: _*Drop in for Druid next Thursday!*_ :druid:

Our Developer Advocates, <@U030K4UM3H7> and <@U030AQ8AR8E>, are hosting Druid Office Hours next Thursday, 9/29 at 11:30am PT – and you don't want to miss it.

What happens at Office Hours?

- All your burning questions about Druid will be answered live! Or at least, we'll try our best
- We'll highlight the latest and greatest updates in the 24.0 release (hello MSQE!)
- You'll meet other cool Druid practitioners and developers

How to join? Head to the event page: <https://www.meetup.com/druidio/events/288476899/>

Hope to see you there!

try adding `-Djava.security.krb5.conf=location` to your java command line?

you'd add it to the `jvm.config` for all the services, as well as to `druid.indexer.runner.javaOpts` or `druid.indexer.runner.javaOptsArray` in `middleManager/runtime.properties` (so it is passed to tasks as well)

druid.indexer.runner.javaOptsArray=true right?

or is it like druid.indexer.runner.javaOpts="-Djava.security.krb5.conf=location"

you’re likely going to have more than one `javaOpts`- like
```druid.indexer.runner.javaOpts=-server -Xmx3g -XX:+IgnoreUnrecognizedVMOptions -XX:MaxDirectMemorySize=10g -Duser.timezone=UTC -XX:+PrintGC -Djava.security.krb5.conf=location```

oh ok but i just need the peon to pick the krb.config config nothing else

yes, you can; this is what `isInputThetaSketch` is for (see <https://druid.apache.org/docs/latest/development/extensions-core/datasketches-theta.html>)

you'd set it to `true` and put the base64-encoded version of the theta sketch in your input files for the backfill

top three generic performance tips:

1) ensure your setup is configured well for your server. ideally, when you're doing a load test, the server is running at 100% CPU (meaning it's fully utilized). if you're looking for subsecond queries you want to be minimizing disk i/o, so ensure disk i/o is also low. if these aren't the case, you may need to adjust settings. we have a bunch of pre-written configs at <https://druid.apache.org/docs/latest/operations/single-server.html> you can start from. the most important settings to get right from the start are druid.processing.numThreads, druid.processing.buffer.sizeBytes, and -Xmx in jvm.config

2) ensure you don't have too many small segments: overheads per-segment can be high in this case. you can use the web console or sys.segments table to see the average number of rows per segment. Generally we target a few million rows per segment. Lower is OK if you have a smaller amount of data. But you don't want tons of segments that are, like, 10000 rows each

3) you can use a flame graph (<https://support.imply.io/hc/en-us/articles/360033747953-Profiling-Druid-queries-using-flame-graphs>) to see what specific code is taking up your processing time. very useful tool!

a _specific_ tip for this query: i don't think `WHERE TIME_FORMAT("__time", 'yyyy-MM-dd') = '2022-09-15'` gets planned into a time-intervals filter, which would slow things down. you can check with EXPLAIN PLAN FOR: see if the filter is an `intervals` (fastest) or a `filter` on `__time` (not as fast)

if it's not getting planned optimally, try rewriting it as `WHERE FLOOR(__time TO DAY) = TIMESTAMP '2022-09-15 00:00:00'`

finally another note: your requirement is quite tight so you may want to look at native queries. it bypasses SQL parsing/planning, which saves about 20–30ms per query. it's not much, so most people should use SQL, since it's definitely easier to use than native. but native is there for cases where saving the 20–30ms is important.

you can see the native query that a SQL plans into by setting `useNativeQueryExplain: true` in your context and running your SQL query prefixed by EXPLAIN PLAN FOR

I just tried it to confirm... `TIME_FORMAT("__time", 'yyyy-MM-dd') = '2022-09-15'` does not get planned into intervals (it is too convoluted) for the planner to spot. Changing it to `FLOOR(__time TO DAY) = TIMESTAMP '2022-09-15 00:00:00'` would let you actually exploit the time partitioning

image.png

How could I tell? I wrote the suspect clause in a query and did EXPLAIN (via the explain SQL query dialog)

notice that the `intervals` are set to `"-146136543-09-08T08:23:32.096Z/146140482-04-24T15:36:27.903Z"` so no bueno

Thanks for all the tips :slightly_smiling_face:. The query is being planned as topn with intervals. I didnt see a difference between native and SQL but ill use native instead. I use `"targetRowsPerSegment": 60000000` which results in partitions of up to 425 mb with a small partition of up to 269 mb.
```{
  "queryType": "topN",
  "dataSource": {
    "type": "table",
    "name": "event_agg_metrics"
  },
  "threshold": 2000,
  "dimension": {
    "type": "default",
    "dimension": "key",
    "outputName": "key",
    "outputType": "LONG"
  },
  "metric": {
    "type": "dimension",
    "previousStop": null,
    "ordering": {
      "type": "numeric"
    }
  },
  "intervals": {
    "type": "intervals",
    "intervals": [
      "2022-09-16T00:00:00.000/2022-09-17T00:00:00.000"
    ]
  },
  "filter": {
    "type": "selector",
    "dimension": "user_id",
    "value": "439963562"
  },
  "granularity": {
    "type": "all"
  },
  "aggregations": [
    {
      "type": "longSum",
      "name": "impression",
      "fieldName": "impression"
    }
  ]
}```

there is an `extensions-contrib/opentelemetry-emitter` in the codebase, contributed in <http://github.com/apache/druid/pull/12015|github.com/apache/druid/pull/12015>

it's a contrib extension, meaning it doesn't have a permanent maintainer in the druid community

if it's relevant to what you want, perhaps you could pick it up and become its permanent maintainer, and also become a druid committer :slightly_smiling_face:

Cool!  Thanks <@U0344FW86DD>!  I'm going to give this a look. I'd love to take on ownership and get committer access.

i haven't personally run into it (i don't use iceberg) but, thinking of a couple possible approaches:

• a one-off approach for each table: if the old name hasn't been reused, then you can use an ingest-time transform, like `COALESCE(oldName, newName) AS newName` in SQL-based ingest or something involving `nvl` in your transformSpec for spec-based ingest, to coalesce the old column into the new one when we read the data files
• a more general approach: it would make sense to have an `iceberg` input source that interfaces with iceberg metadata and handles this kind of stuff automatically

1. About nvl 
```nvl(expr,expr-for-null) returns 'expr-for-null' if 'expr' is null (or empty string for string type)```
I believe `expr-for-null` is a constant value, right?
2. This would be ideal.
Thinking out loud, for a Spark based Druid connector, this should be solved out of the box as long as the dataframe to be written to Druid is formed by using the spark-iceberg integration (and not going under the hood to retrieve the data files etc).

`expr-for-null` in `nvl` can be any expr, including a column reference

yeah, agree for (2) — there's two forms it could take and i think in an ideal world both would exist. one is spark writes to druid, and then it's handled since spark harmonizes all the iceberg stuff before writing it. two is druid reads from iceberg directly, and in that case we'd want some iceberg-aware input source

<@U030G8YG9GD> its widely used. Very widely

<https://github.com/apache/druid/issues/new/choose>

Hey Druids, we have created a first set of videos with explanations on Druid concepts and practical applications. This set covers secondary partitioning in Druid and how it helps in pruning including demos. Let us know what you think.
<https://www.youtube.com/watch?v=7caGijktulo&amp;list=PLDZysOZKycN4wNrNUpUhfXLP8Q___5omG|Druid Notebooks on Partitioning Playlist>

you want to _not_ match something like `[tag1, tag2, tag3]`?

maybe `MV_CONTAINS(tags, ARRAY["tag1", "tag2"]) AND MV_LENGTH(tags) = 2`?

Not sure if that's faster or slower than your suggestion

you can use `PARSE_JSON` (new in v24) <https://druid.apache.org/docs/latest/querying/sql-json-functions.html>

for performance reasons, it's best to do this during ingestion and then store the result as a JSON column

example here <https://druid.apache.org/docs/latest/querying/nested-columns.html#ingest-a-json-string-as-complexjson>

Just to expand on the answer above and to answer your question directly: Yes with Druid 24 you can do it. You could either (1) PARSE_JSON and ingest `raw_info` as a json column like so: `INSERT INTO ... SELECT ..., PARSE_JSON(raw_info) AS raw_info, ...` or (2) you could actually flatten it at ingest time like you were asking by doing: `INSERT INTO ... SELECT ..., JSON_VALUE(PARSE_JSON(raw_info), '$.foo') AS "raw_info.foo", ...` but there is almost no reason to do (2) as (1) more flexible, easier, and more efficient. The only 'valid' reason to do (2) is if you want to do CLUSTERED BY (range based secondary partitioning) on `raw_info.foo`

Thanks for answers. Is there any solution for druid 23? Recently we've upgraded druid to 24 and started to getting memory errors that's why we had to rollback to 23.  Btw in my question data coming from kafka.

JSON parsing and all the other json awesomeness was only added in 24. In 24 you can also access the JSON functions via [math-expr](<https://druid.apache.org/docs/latest/misc/math-expr.html>) that you can put in a transform spec for a supervisor so it would work for Kafka also. I am sorry to hear about your rollback. Did you post an issue or thread about what you saw? If so can you link it here?

I've only posted here
<https://apachedruidworkspace.slack.com/archives/C0309C9L90D/p1663936671637089>

Good morning! We're just a few hours away from the September edition of *Apache Druid Office Hours*, taking place at 11:30am PT.

• This is a virtual meetup: no matter your location, everyone is welcome to join!
• Office Hours is not recorded. It serves as a safe space to ask any of your Druid questions (no judgement; only support!)
• You can join directly via the Zoom link: <https://imply-io.zoom.us/j/9046278839>
<@U030AQ8AR8E>, <@U030K4UM3H7> and I hope to see you there!

this is a bit beyond what flattenspecs can do, but you can look at the new JSON column feature in druid 24 for help here!

<https://druid.apache.org/docs/latest/querying/sql-json-functions.html>

and <https://druid.apache.org/docs/latest/misc/math-expr.html#json-functions>

they are usable at ingest time through SQL-based ingest (SQL functions) and native ingest (the `math-expr` functions can be used in transformSpec)

some examples here <https://druid.apache.org/docs/latest/querying/nested-columns.html>

<@U0344FW86DD> Looking forward to trying out `nested columns` Is this feature currently supported for streaming ingestion? It appears not since documentation calls out SQL based and batch ingestion methods but not streaming.

I believe it is supported for streaming ingestion, we just don't have an example (cc <@U030N8QKRTN>?)

Usage with streaming would be similar to native batch: the `dimensionsSpec` and `transformSpec` work the same way

Filed a GH issue for it: <https://github.com/apache/druid/issues/13174>

ty! added some info from this thread, including a link, and summary (since the link is only good for 90 days)

Thanks, it makes more sense with your clarifications

It's automated somehow from git tags, although I don't remember exactly how it gets built

I think it's from <https://github.com/apache/druid/tree/master/distribution/docker>

for ARM64, a pull request to those files should do it

for publishing to ECR, that's different story since we'd need some different automation. the dockerhub automation we're using is something we get from being an Apache project

Does arm build require something special? Isn't Java a write once run everywhere language?

It needs to run an ARM JVM in an ARM base container. Though for the most part it is just adding another target to a docker build kit build assuming you are basing off a container that is already multi-arch.

For publishing the ARM64 image, we would need to file an Apache INFRA ticket once we have the Dockerfile changes. I would suggest to start there first.

Hey everyone, is SCAN O(n) or O(1) with preset dimensions? I'm finding my queries have slowed down a lot, I'd like to use the SCAN result format (I need all the fields), so while I know SEARCH is an option, I couldn't get it to return more fields than just those that match, any ideas?

and `insensitive_contains` search filter needs to search the entire dictionary, I noticed that you query is searching over all time (intervals is "1000-01-01/4000-01-01") is it possible that your queries slowed down because you have more data now? How fast did this used to work?

Yes, there's a lot more data, but coming from Elasticsearch and given Druid being columnar I'd hoped it would still work. I do want to always search all data for queries. Is it just going to keep getting slower and slower until it crashes or is there something else I can do to optimise?

due to the `limit: 5` i am guessing that the time it takes to pull matching results out of the columns is trivial

so i am guessing that the lion share of time taken is the `insensitive_contains` filter

(this could be confirmed with profiling, but this is my guess)

is `msg` like a free form text column? i am just guessing from its name

like <@U030W50CNAC> mentioned, `insensitive_contains` works by loading up every distinct `msg` and doing a case insensitive string search on each `msg`. for free form text, this can take quite some time, since there are generally going to be a lot of distinct `msg` (since it's free form)

in elasticsearch you have full text search which tokenizes freeform text and indexes the individual tokens

in druid there is not a full text search feature, but if you like, you can achieve the same performance by doing it "by hand":

1) tokenize your free form text field during ingest: for example convert `"Here is a sentence."` to `["here", "sentence"]` (i.e. remove punctuation, split on whitespace, remove common words like `is` and `a`, ingest as array of strings). if you load this into a column named `msg_tokens` then druid will individually index each token.

2) when searching, tokenize your search string the same way, and use `selector` filters on `msg_tokens` instead of `insensitive_contains` on `msg`. this is a lot faster, since it can leverage the index for the individual tokens.

this is more or less what elasticsearch does when you ask it to do a full text index of a string field

yep, msg is a freeform upper and lowercase text column

would it be faster to search two lists for case sensitive words than case insensitive on a freeform string?

ah i see, this is exactly what you said, i separate out the nouns, verbs, adverbs and adjective into separate arrays per message so i can just search one or more of these

you don't need to even do the grammar stuff (nouns, verbs, adverbs) just any opportunity to turn insensitive_contains into a selector filter would give you huge perf boosts

okay, I do the grammar stuff anyway so it would be a minimal change to search them, I'm just asking if it'll be more than 200% faster, because I'd need to search two lists per entry

or, I can refactor my data to store tokens as well but I fear that may take up more space, so not an ideal solution, would rather use my existing fields

I could signal to the user they are only searching a type of word and then let them choose

would it help significantly to make it case sensitive?

I mean no nevermind that's not ideal either

so I can just remove stopwords and non printables and .split() and store it in a list, that's good to know

then my only gripe would be exact matches but I suppose tokens may work as well, if I remove stopwords from the search string and covert it to a similar type. Can I search with a list?

if I implement it the same way on the indexer and search string preprocessor then the result for the same given string would be the same and even though I'm searching with unordered tokens it would return what I want, without any order albeit but I'm not sure I can mandate that the words should appear in the given order

You can set this
`druid.lookup.lookupTier:someUnkownTier`  and see if it works.
also `druid.lookup.enableLookupSyncOnStartup:false` on the peons

I am not sure if the unkownTier will work or not but worth a try

It does it for two reasons:

1) Realtime ingest tasks need to load all lookups because someone might use them at query time (realtime ingest tasks do handle queries, and of course we don't know in advance what queries will come)

2) Batch ingest tasks need to load them if they are referenced in the `transformSpec` (native batch ingest) or in the SQL itself using the `LOOKUP` function or a `lookup.*` table (SQL-based ingest). Here, we could know in advance which lookups are needed, but we load them all anyway because we haven't currently implemented this logic. If we did implement such logic then we'd only need to load the ones that are actually used

<@U0344FW86DD> Thanks for the detailed explanation !.  Guess we don’t want to disable that flag then.

Do you have a general recommendation for large growing lookup tables? Because as it grows, both during query and ingestion, large amount of memory was loaded for the lookup

Generally, as they grow you'll need to increase your heap sizes

This (increasing heap size needs for large lookups) is a common problem people run into with lookups &amp; we're looking into solutions. Hope there will be news soon :slightly_smiling_face:

Thanks for all the hard work! Looking forward to it

Most of the current ideas involves using off-heap memory, or even memory-mapping like the regular segments

Untitled.txt

Attached is the one of the sample log from druid

Can you replicate the topic to the newer cluster using mirror maker or something ? That may solve the problem .

IMO there is no way currently to read from two different kafka clusters to the same data source.

Another way would be to drain the old cluster fully and then  take note of the offsets.
When the old cluster data is fully ingested in druid, start ingestion on the same data source with the newer cluster.

is the new kafka always up to date? if yes, then it's a matter of figuring out the offset of the old when you paused the Supervisor.

I think the offset is stored in MySQL/PostgreSQL. Now the thing that I don't know is, is it possible to start a new Supervisor with the exact offset using the API.

oh, hm… but offset in old kafka is not the same as the new kafka.

Yeah, atm I think the safest route is to use Union query so that a single query hit two data souces. We delete the old data source once retention is up

That's a very good idea. I will remember that if in the future I have this problem.

Sure, I’ve got a wide table with several sketch columns that I would like to intersect and store in a new table:
```INSERT INTO user_activity
SELECT user_cohort_country.__time, user_cohort_country.cohort, user_cohort_country.country, 
    THETA_SKETCH_INTERSECT(
      DS_THETA(user_cohort_country.segments), 
      DS_THETA(user_cohort_country.countries)
    ) as sketch
FROM user_cohort_country
GROUP BY 1, 2, 3
PARTITIONED BY DAY```

And what is the exception you are getting with the full stack trace

I had the same issue when there was sometimes segments without the field I tried to use in my multi stage query.

I assumed that was the cause of my problem because if I filtered the `__time`  it worked fine

I’ve only been using druid for a few weeks though

<@U0452LNQWG0> I can try to repro the issue if you share exactly what you are doing. Maybe start another thread for it.

<@U0315MQKEHK> I’ve shared the full stack trace, it’s quite opaque and I’ve tried limiting the query as Julien suggested.

David can you try setting enable group by un nesting to true ?
It would be present in the engine selection screen

Thanks Karan, I still get the same issue and it seems to originate in the `makeDimensionsAndAggregatorsForIngestion` method in the execution controller.

Yes this looks like an issue. I am trying to repro it locally

Interestingly, I can preview the results.

Please let me know if I can help in any way.

Because the preview stuff goes through a different sample api IIRC

As a workaround, I tried saving the results of the query and re-ingesting as batch but this didn’t work very well.

I think the issue is how the aggregator factory is spitting out the name of the "output col"

I see, so do you think if I used aliases in my SELECT it would work?  Or is this internal?

You can try not sure till i get to the root cause

But its not a user error if thats what you are asking

Thanks for clarifying, I should have rephrased my question in that way.  Aliases didn’t help as you suspected.

Screenshot 2022-10-03 at 8.17.08 PM.png

```REPLACE INTO country_shard_test_1 OVERWRITE ALL
SELECT Country,Capital,    THETA_SKETCH_INTERSECT(
      DS_THETA(Country),
      DS_THETA(Capital)
    ) as sketch FROM TABLE(
  EXTERN(
    '{"type":"http","uris":["<https://static.imply.io/lookup/country.tsv>"]}',
    '{"type":"tsv","findColumnsFromHeader":true}',
    '[{"name":"Country","type":"string"},{"name":"Capital","type":"string"},{"name":"ISO3","type":"string"},{"name":"ISO2","type":"string"}]'
  )
  )
  group by 1,2
  PARTITIONED BY ALL```
I can repro the issue.
To circumvent this, you can set  <https://files.slack.com/files-pri/T0306CNUA90-F045919T64R/screenshot_2022-10-03_at_8.17.08_pm.png> to `TRUE` and check ?

The issue is in the post aggregator. The controller is not handling the post aggregator factories correctly

<@U0315MQKEHK> do you have a fix in mind? i was looking at this part of the code and it looks like the issue is this query is detected as a "with rollup" query; however, the sketch comes from a post-agg, not an agg, and so we can't get the agg factory for it (as there is none).

seems to me like in this case we should go down the "non-rollup" ingest path, and add a new requirement to the docs rollup section: agg functions must be "plain" and not involved in expressions

i guess this is why you suggested setting "enable group by multi value unnesting" to true to work around it: it causes the execution to go down the "non-rollup" path anyway

Correct. I donot have a fix in mind for now.

i wonder if another good workaround (maybe better?) would be to wrap the `THETA_SKETCH_INTERSECT` in another `DS_THETA`?

or possibly `APPROX_COUNT_DISTINCT_DS_THETA` (assuming finalize aggregations is false, since we want to store the sketch)

&gt; Correct. I donot have a fix in mind for now.
For a fix, I think we should switch ingestion to the no-rollup path if any output columns are neither dims nor aggs. That's, I think, the most reasonable thing we can do until we provide a way to define rollup in a catalog

So that the user can take an informed decision

`APPROX_COUNT_DISTINCT_DS_THETA` I am not sure how we will pass 2 col's to it.

I feel that an error would make sense if the user specifies rollup explicitly, but they don't, so it seems weird to throw an error. Currently the rollup mode for ingestion is triggered based on examination of the query, rather than an explicit user intent

If we had an explicit user intent for this, then it'd make sense for it to be an error if we couldn't satisfy the intent

I think thats a console thing to figure out if its a roll up query or no rite ?

`
```private static boolean isRollupQuery(Query&lt;?&gt; query)
{
  return query instanceof GroupByQuery
         &amp;&amp; !MultiStageQueryContext.isFinalizeAggregations(query.getQueryContext())
         &amp;&amp; !query.getContextBoolean(GroupByQueryConfig.CTX_KEY_ENABLE_MULTI_VALUE_UNNESTING, true);
}```


So we will not know on the backend that what's the user intent

is it set explicitly or is it set by the console automatically

Karan and Gian, setting group by multi-value unnesting to true worked and the table populated correctly - thanks for your help!

&gt;  i wonder if another good workaround (maybe better?) would be to wrap the `THETA_SKETCH_INTERSECT` in another `DS_THETA`
I couldn’t get this working - calcite fails to parse the query because aggregate expressions cannot be nested.

good to know David! we will find a nice fix for this; thanks for bringing it up

&gt; I couldn’t get this working - calcite fails to parse the query because aggregate expressions cannot be nested.
ah, of course, that makes sense!

has anyone raised a github issue for this? we should move the discussion over there

github issue : <https://github.com/apache/druid/issues/13180>

Partial fix : <https://github.com/apache/druid/pull/13179>

Hey! I ran into this issue again with another query, I just wanted to say that `setting group by multi-value unnesting to true` worked for me too :slightly_smiling_face:

Thank you all for you help and thank you David for raising the issue!

I'm not sure what "inconvenient in terms of data selection" means here — it would be helpful to know what kind of selection you are wanting to do?

the desire here sounds similar-ish to the desires people have when implementing multi-tenant workloads, so this doc may be useful: <https://druid.apache.org/docs/latest/querying/multitenancy.html>

it has some info about how to think about whether to use one giant datasource, or split it up

the doc highlights how to consider what kind of data management and retrieval operations may be necessary, and how that influences choice of design

hi, that's more a question for an Imply rep or Imply support-- this channel is about Apache Druid

Fair enough. I asked at our slack channel and havent gotten a response.

fwiw typically Imply releases come out the 3rd or 4th week of the month. Although I am not sure if this particular patch will be in the October one or not

ok thats good to know. I'll downgrade-and-test until i find the working release.

Couple of things you could do :
1. <https://support.imply.io/hc/en-us/articles/360033747953-Profiling-Druid-queries-using-flame-graphs>
2. Increase the task count 


3. Another thing what I have seen is that late data slows the ingestion down. So you could  set : `lateMessageRejectionPeriod` appropriately

We are already running 5 tasks and trying to increase performance at task level, increasing the task count means we need to increase the number of partitions as well, right?
Changing lateMessageRejectionPeriod doesn't seem to help in our case.

If you have 5 partitions then yes you would have to increase the partitions .

For about 1.5M events per minute, I think 5 partitions is quite low from a kafka best practices POV.

If you are going for a small test/dev single server setup. You can use the nano quickstart configuration files at `./conf/druid/single-server/nano-quickstart` as a good starting point.

is there documentation on how to do this in docker?

oh, DRUID_SINGLE_NODE_CONF=micro-quickstart

On nano it still seems to allocate all available RAM and then my box begins to freeze

I'm pretty sure the variable is not working because it's running `java -server -Xms8g -Xmx8g -XX:MaxDirectMemorySize=13g -XX:+ExitOnOutOfMemoryError -Duser.timezone=UTC -Dfile.encoding=UTF-8 -<http://Djava.io|Djava.io>.tmpdir=var/tmp -Djava.util.logging.manager=org.apache.logging.log4j.jul.LogManager -cp /tmp/conf/druid/cluster/_common:/tmp/conf/druid/cluster/data/historical:lib/*: org.apache.druid.cli.Main server historica`

Clearly not. I’m not sure how the JVM properties are being set with the docker distribution. I’ll take dive in there to see what’s going on. I the meantime, you could just use the quickstart to run it outside the docker container.

<@U044F0X91L6>  From a first look at the dockerfile entrypoint `druid.sh` and the docker-compose.yaml, it seems like the docker compose will start each pod with the single parameter "broker","historical", etc. for each kind of service. As far as I can tell, in order to use the service specific jvm.config files in the single-server configurations, you need to only define the `DRUID_SINGLE_NODE_CONF` and NOT define any of the java opts overrides namely
```#DRUID_XMX=1g
#DRUID_XMS=1g
#DRUID_MAXNEWSIZE=250m
#DRUID_NEWSIZE=250m
#DRUID_MAXDIRECTMEMORYSIZE=6172m```
otherwise, those environment variables override the configs for individual services.
```<https://github.com/apache/druid/blob/eff7edb6032fecf112feb289c8193baec8e4b41b/distribution/docker/druid.sh#L160>```
I still don't really understand where the values 8g and 13g are coming from. I would've expected all services to use the overrides you provided which I thought were
```DRUID_XMX=1g
DRUID_XMS=1g```
But I suggest trying just
```DRUID_SINGLE_NODE_CONF=nano-quickstart```

No luck with that one so I did this instead:
```ARG DRUID_VER=0.23.0

FROM apache/druid:${DRUID_VER} AS druid

FROM ubuntu:bionic
RUN apt-get update &amp;&amp; \
    apt-get install --yes openjdk-8-jre-headless perl-modules &amp;&amp; \
    apt-get clean

RUN addgroup --system -gid 1000 druid \
 &amp;&amp; adduser --system --uid 1000 --disabled-password --home /opt/druid --shell /bin/bash --group druid

COPY --from=druid --chown=druid:druid /opt/druid /opt/druid

WORKDIR /opt/druid

USER druid

EXPOSE 8888/tcp
EXPOSE 8081/tcp

CMD /opt/druid/bin/start-nano-quickstart```

seems to work better, it's neater to have just one container IMO

That's really cool, I wonder if the startup command could be parameterized so you could use any of the single node startup configs.

I think I could have just used the "command directive too" from the compose file since before it was just the name of the service, I'd have to specify the whole path not just nano or micro etc

I haven't played with this myself yet, but it's probably worth checking out:

<https://druid.apache.org/docs/latest/development/extensions-contrib/moving-average-query.html>

Druid doesn't currently support windowed aggregations in SQL; there is the limited support for certain windowed ops that Mark linked for native queries only

Hello, I’d say you should deploy using the way you’re the most comfortable with

I had the same exact question and I just deployed it to K8s with operator. with zookeeper

Now I came across <https://druid.apache.org/docs/latest/development/extensions-core/kubernetes.html> not sure which one is reliable for prod

I agree that whichever method you're most comfortable with is best, since both work 

As to the extension you linked, there is some discussion about it in <https://github.com/apache/druid/issues/12904|https://github.com/apache/druid/issues/12904>

It's still experimental; recently <@U030MNRGBNG> found that upgrading the k8s client was helpful in improving its stability: <https://github.com/apache/druid/pull/13175|https://github.com/apache/druid/pull/13175>

You can follow that issue for updates and use the patch if you want to be on the cutting edge. Otherwise ZK is the more stable path at this time. But this is changing :slightly_smiling_face:

I am currently working on a bug in the HTTP change synchronization that I think is going to be critical for discovery and sync when using kubernetes.  When using ZK for discovery and coordination in kubernetes there are some issue around stale host records that occurs when a ZK pod is replaced, but cycling the coordinator and broker pods on a regular basis (hourly seems to be adequate) prevents it from becoming critical.  If you are deploying to K8s right now I would use the ZK discovery mechanisms until we can work out the glitches in HTTP syncing.  I am actively working on this now but it could take a while to get to a point with reliability equivalent too or better than the current ZK implementation.,

Thanks for your replay! I am using ZK with our k8s deployment but literally right now I’m seeing ZK automatically restarting and due to that druid pods are stuck in Notready state.
```2022-10-06T20:22:12,317 ERROR [main-SendThread()] org.apache.zookeeper.client.StaticHostProvider - Unable to resolve address: tiny-cluster-zk-2.tiny-cluster-zk:2181
java.net.UnknownHostException: tiny-cluster-zk-2.tiny-cluster-zk: Name or service not known
	at java.net.Inet4AddressImpl.lookupAllHostAddr(Native Method) ~[?:1.8.0_275]
	at java.net.InetAddress$2.lookupAllHostAddr(InetAddress.java:929) ~[?:1.8.0_275]
	at java.net.InetAddress.getAddressesFromNameService(InetAddress.java:1324) ~[?:1.8.0_275]
	at java.net.InetAddress.getAllByName0(InetAddress.java:1277) ~[?:1.8.0_275]
	at java.net.InetAddress.getAllByName(InetAddress.java:1193) ~[?:1.8.0_275]
	at java.net.InetAddress.getAllByName(InetAddress.java:1127) ~[?:1.8.0_275]
	at org.apache.zookeeper.client.StaticHostProvider$1.getAllByName(StaticHostProvider.java:92) ~[zookeeper-3.5.9.jar:3.5.9]
	at org.apache.zookeeper.client.StaticHostProvider.resolve(StaticHostProvider.java:147) [zookeeper-3.5.9.jar:3.5.9]
	at org.apache.zookeeper.client.StaticHostProvider.next(StaticHostProvider.java:375) [zookeeper-3.5.9.jar:3.5.9]
	at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1137) [zookeeper-3.5.9.jar:3.5.9]
TRACE StatusLogger Log4jLoggerFactory.getContext() found anchor class org.apache.zookeeper.client.ZooKeeperSaslClient
TRACE StatusLogger Call to LogManager.getLogger(org.postgresql)
TRACE StatusLogger Call to LogManager.getLogger(org.postgresql.Driver)
TRACE StatusLogger Log4jLoggerFactory.getContext() found anchor class org.apache.zookeeper.SaslServerPrincipal
TRACE StatusLogger Call to LogManager.getLogger(org.postgresql.util.SharedTimer)
2022-10-06T20:22:12,328 WARN [main-SendThread(tiny-cluster-zk-2.tiny-cluster-zk:2181)] org.apache.zookeeper.ClientCnxn - Session 0x0 for server tiny-cluster-zk-2.tiny-cluster-zk:2181, unexpected error, closing socket connection and attempting reconnect
java.lang.IllegalArgumentException: Unable to canonicalize address tiny-cluster-zk-2.tiny-cluster-zk:2181 because it's not resolvable
	at org.apache.zookeeper.SaslServerPrincipal.getServerPrincipal(SaslServerPrincipal.java:71) ~[zookeeper-3.5.9.jar:3.5.9]
	at org.apache.zookeeper.SaslServerPrincipal.getServerPrincipal(SaslServerPrincipal.java:39) ~[zookeeper-3.5.9.jar:3.5.9]
	at org.apache.zookeeper.ClientCnxn$SendThread.startConnect(ClientCnxn.java:1087) ~[zookeeper-3.5.9.jar:3.5.9]
	at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1139) [zookeeper-3.5.9.jar:3.5.9]
TRACE StatusLogger Log4jLoggerFactory.getContext() found anchor class org.apache.commons.logging.impl.SLF4JLogFactory
TRACE StatusLogger Call to LogManager.getLogger(org.postgresql.jdbc.PgConnection)
TRACE StatusLogger Call to LogManager.getLogger(org.postgresql.core.v3.ConnectionFactoryImpl)
TRACE StatusLogger Call to LogManager.getLogger(org.postgresql.core.Encoding)
TRACE StatusLogger Call to LogManager.getLogger(org.postgresql.util.PGPropertyMaxResultBufferParser)
TRACE StatusLogger Call to LogManager.getLogger(org.postgresql.ssl.MakeSSL)
2022-10-06T20:22:12,433 INFO [main-SendThread(tiny-cluster-zk-1.tiny-cluster-zk:2181)] org.apache.zookeeper.ClientCnxn - Opening socket connection to server tiny-cluster-zk-1.tiny-cluster-zk/10.1.204.205:2181. Will not attempt to authenticate using SASL (unknown error)
2022-10-06T20:22:12,439 INFO [main-SendThread(tiny-cluster-zk-1.tiny-cluster-zk:2181)] org.apache.zookeeper.ClientCnxn - Socket connection established, initiating session, client: /10.1.208.189:50130, server: tiny-cluster-zk-1.tiny-cluster-zk/10.1.204.205:2181
2022-10-06T20:22:12,448 INFO [main-SendThread(tiny-cluster-zk-1.tiny-cluster-zk:2181)] org.apache.zookeeper.ClientCnxn - Session establishment complete on server tiny-cluster-zk-1.tiny-cluster-zk/10.1.204.205:2181, sessionid = 0x20000045f350003, negotiated timeout = 30000
2022-10-06T20:22:12,456 INFO [main-EventThread] org.apache.curator.framework.state.ConnectionStateManager - State change: CONNECTED
2022-10-06T20:22:12,470 INFO [main-EventThread] org.apache.curator.framework.imps.EnsembleTracker - New config event received: {server.1=tiny-cluster-zk-0.tiny-cluster-zk:2888:3888:participant;0.0.0.0:2181, version=0, server.3=tiny-cluster-zk-2.tiny-cluster-zk:2888:3888:participant;0.0.0.0:2181, server.2=tiny-cluster-zk-1.tiny-cluster-zk:2888:3888:participant;0.0.0.0:2181}
2022-10-06T20:22:12,476 ERROR [main-EventThread] org.apache.curator.framework.imps.CuratorFrameworkImpl - Background exception was not retry-able or retry gave up
java.lang.NullPointerException: null
	at org.apache.curator.framework.imps.EnsembleTracker.configToConnectionString(EnsembleTracker.java:185) ~[curator-framework-4.3.0.jar:4.3.0]
	at org.apache.curator.framework.imps.EnsembleTracker.processConfigData(EnsembleTracker.java:206) ~[curator-framework-4.3.0.jar:4.3.0]
	at org.apache.curator.framework.imps.EnsembleTracker.access$300(EnsembleTracker.java:50) ~[curator-framework-4.3.0.jar:4.3.0]
	at org.apache.curator.framework.imps.EnsembleTracker$2.processResult(EnsembleTracker.java:150) ~[curator-framework-4.3.0.jar:4.3.0]
	at org.apache.curator.framework.imps.CuratorFrameworkImpl.sendToBackgroundCallback(CuratorFrameworkImpl.java:883) [curator-framework-4.3.0.jar:4.3.0]
	at org.apache.curator.framework.imps.CuratorFrameworkImpl.processBackgroundOperation(CuratorFrameworkImpl.java:653) [curator-framework-4.3.0.jar:4.3.0]
	at org.apache.curator.framework.imps.WatcherRemovalFacade.processBackgroundOperation(WatcherRemovalFacade.java:152) [curator-framework-4.3.0.jar:4.3.0]
	at org.apache.curator.framework.imps.GetConfigBuilderImpl$2.processResult(GetConfigBuilderImpl.java:222) [curator-framework-4.3.0.jar:4.3.0]
	at org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:598) [zookeeper-3.5.9.jar:3.5.9]
	at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:510) [zookeeper-3.5.9.jar:3.5.9]
2022-10-06T20:22:12,478 ERROR [main-EventThread] org.apache.druid.curator.CuratorModule - Unhandled error in Curator, stopping server.
java.lang.NullPointerException: null
	at org.apache.curator.framework.imps.EnsembleTracker.configToConnectionString(EnsembleTracker.java:185) ~[curator-framework-4.3.0.jar:4.3.0]
	at org.apache.curator.framework.imps.EnsembleTracker.processConfigData(EnsembleTracker.java:206) ~[curator-framework-4.3.0.jar:4.3.0]
	at org.apache.curator.framework.imps.EnsembleTracker.access$300(EnsembleTracker.java:50) ~[curator-framework-4.3.0.jar:4.3.0]
	at org.apache.curator.framework.imps.EnsembleTracker$2.processResult(EnsembleTracker.java:150) ~[curator-framework-4.3.0.jar:4.3.0]
	at org.apache.curator.framework.imps.CuratorFrameworkImpl.sendToBackgroundCallback(CuratorFrameworkImpl.java:883) [curator-framework-4.3.0.jar:4.3.0]
	at org.apache.curator.framework.imps.CuratorFrameworkImpl.processBackgroundOperation(CuratorFrameworkImpl.java:653) [curator-framework-4.3.0.jar:4.3.0]
	at org.apache.curator.framework.imps.WatcherRemovalFacade.processBackgroundOperation(WatcherRemovalFacade.java:152) [curator-framework-4.3.0.jar:4.3.0]
	at org.apache.curator.framework.imps.GetConfigBuilderImpl$2.processResult(GetConfigBuilderImpl.java:222) [curator-framework-4.3.0.jar:4.3.0]
	at org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:598) [zookeeper-3.5.9.jar:3.5.9]
	at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:510) [zookeeper-3.5.9.jar:3.5.9]
TRACE StatusLogger Call to LogManager.getLogger(org.postgresql.jre7.sasl.ScramAuthenticator)
TRACE StatusLogger Call to LogManager.getLogger(org.postgresql.core.QueryExecutorBase)
TRACE StatusLogger Call to LogManager.getLogger(org.postgresql.core.v3.QueryExecutorImpl)
TRACE StatusLogger Call to LogManager.getLogger(org.postgresql.core.v3.SimpleQuery)
TRACE StatusLogger Call to LogManager.getLogger(org.postgresql.core.v3.replication.V3ReplicationProtocol)
TRACE StatusLogger Call to LogManager.getLogger(org.postgresql.jdbc.TypeInfoCache)
TRACE StatusLogger Log4jLoggerFactory.getContext() found anchor class org.apache.commons.logging.impl.SLF4JLogFactory
2022-10-06T20:22:12,745 INFO [main] org.apache.druid.metadata.SQLMetadataConnector - Table[druid_config] already exists
2022-10-06T20:22:12,826 INFO [main] org.apache.druid.metadata.SQLMetadataConnector - Table[druid_audit] already exists
2022-10-06T20:22:12,830 INFO [main] org.apache.druid.security.basic.authentication.db.updater.CoordinatorBasicAuthenticatorMetadataStorageUpdater - Starting CoordinatorBasicAuthenticatorMetadataStorageUpdater.
TRACE StatusLogger Log4jLoggerFactory.getContext() found anchor class org.apache.druid.java.util.common.logger.Logger
2022-10-06T20:22:12,851 INFO [main] org.apache.druid.security.basic.authorization.db.updater.CoordinatorBasicAuthorizerMetadataStorageUpdater - Starting CoordinatorBasicAuthorizerMetadataStorageUpdater
TRACE StatusLogger Log4jLoggerFactory.getContext() found anchor class org.apache.druid.java.util.common.logger.Logger
TRACE StatusLogger Log4jLoggerFactory.getContext() found anchor class org.apache.curator.utils.CloseableExecutorService
TRACE StatusLogger Log4jLoggerFactory.getContext() found anchor class org.apache.curator.framework.recipes.cache.PathChildrenCache
TRACE StatusLogger Log4jLoggerFactory.getContext() found anchor class org.apache.curator.framework.listen.ListenerContainer
TRACE StatusLogger Log4jLoggerFactory.getContext() found anchor class org.apache.curator.framework.imps.CuratorFrameworkImpl
TRACE StatusLogger Log4jLoggerFactory.getContext() found anchor class org.apache.curator.framework.imps.WatcherRemovalManager
2022-10-06T20:22:12,885 INFO [main] org.apache.druid.client.CoordinatorServerView - CoordinatorServerView waiting for initialization.
TRACE StatusLogger Log4jLoggerFactory.getContext() found anchor class org.apache.curator.connection.ThreadLocalRetryLoop
TRACE StatusLogger Log4jLoggerFactory.getContext() found anchor class org.apache.curator.RetryLoopImpl
TRACE StatusLogger Log4jLoggerFactory.getContext() found anchor class org.apache.curator.RetryLoopImpl
2022-10-06T20:25:45,670 INFO [main-SendThread(tiny-cluster-zk-1.tiny-cluster-zk:2181)] org.apache.zookeeper.ClientCnxn - Unable to read additional data from server sessionid 0x20000045f350003, likely server has closed socket, closing socket connection and attempting reconnect
2022-10-06T20:25:46,071 INFO [main-SendThread(tiny-cluster-zk-0.tiny-cluster-zk:2181)] org.apache.zookeeper.ClientCnxn - Opening socket connection to server tiny-cluster-zk-0.tiny-cluster-zk/10.1.209.241:2181. Will not attempt to authenticate using SASL (unknown error)
2022-10-06T20:25:46,072 INFO [main-SendThread(tiny-cluster-zk-0.tiny-cluster-zk:2181)] org.apache.zookeeper.ClientCnxn - Socket connection established, initiating session, client: /10.1.208.189:32834, server: tiny-cluster-zk-0.tiny-cluster-zk/10.1.209.241:2181
2022-10-06T20:25:46,077 INFO [main-SendThread(tiny-cluster-zk-0.tiny-cluster-zk:2181)] org.apache.zookeeper.ClientCnxn - Session establishment complete on server tiny-cluster-zk-0.tiny-cluster-zk/10.1.209.241:2181, sessionid = 0x20000045f350003, negotiated timeout = 30000  ```

ZK log
```ERROR [ListenerHandler-tiny-cluster-zk-1.tiny-cluster-zk:3888:QuorumCnxManager$Listener$ListenerHandler@1094] - Exception while listening
java.net.SocketException: Unresolved address
	at java.base/java.net.ServerSocket.bind(Unknown Source)
	at java.base/java.net.ServerSocket.bind(Unknown Source)
	at org.apache.zookeeper.server.quorum.QuorumCnxManager$Listener$ListenerHandler.createNewServerSocket(QuorumCnxManager.java:1136)
	at org.apache.zookeeper.server.quorum.QuorumCnxManager$Listener$ListenerHandler.acceptConnections(QuorumCnxManager.java:1065)
	at org.apache.zookeeper.server.quorum.QuorumCnxManager$Listener$ListenerHandler.run(QuorumCnxManager.java:1034)
	at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Unknown Source)
	at java.base/java.util.concurrent.FutureTask.run(Unknown Source)
	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
	at java.base/java.lang.Thread.run(Unknown Source)
2022-10-06 20:09:27,502 [myid:2] - INFO  [main:ZKAuditProvider@42] - ZooKeeper audit is disabled.
2022-10-06 20:09:27,506 [myid:2] - INFO  [QuorumPeer[myid=2](plain=0.0.0.0:2181)(secure=disabled):QuorumPeer@1430] - LOOKING
2022-10-06 20:09:27,507 [myid:2] - INFO  [QuorumPeer[myid=2](plain=0.0.0.0:2181)(secure=disabled):FastLeaderElection@945] - New election. My id = 2, proposed zxid=0x400000001
2022-10-06 20:09:27,563 [myid:2] - INFO  [WorkerReceiver[myid=2]:FastLeaderElection$Messenger$WorkerReceiver@390] - Notification: my state:LOOKING; n.sid:2, n.state:LOOKING, n.leader:2, n.round:0x1, n.peerEpoch:0x4, n.zxid:0x400000001, message format version:0x2, n.config version:0x0
2022-10-06 20:09:27,564 [myid:2] - INFO  [QuorumConnectionThread-[myid=2]-2:QuorumCnxManager@514] - Have smaller server identifier, so dropping the connection: (myId:2 --&gt; sid:3)
2022-10-06 20:09:27,570 [myid:2] - INFO  [WorkerReceiver[myid=2]:FastLeaderElection$Messenger$WorkerReceiver@390] - Notification: my state:LOOKING; n.sid:1, n.state:FOLLOWING, n.leader:3, n.round:0x5, n.peerEpoch:0x4, n.zxid:0x30000003e, message format version:0x2, n.config version:0x0
2022-10-06 20:09:27,572 [myid:2] - INFO  [WorkerReceiver[myid=2]:FastLeaderElection$Messenger$WorkerReceiver@390] - Notification: my state:LOOKING; n.sid:1, n.state:FOLLOWING, n.leader:3, n.round:0x5, n.peerEpoch:0x4, n.zxid:0x30000003e, message format version:0x2, n.config version:0x0
2022-10-06 20:09:27,774 [myid:2] - INFO  [QuorumPeer[myid=2](plain=0.0.0.0:2181)(secure=disabled):FastLeaderElection@980] - Notification time out: 400
2022-10-06 20:09:27,774 [myid:2] - INFO  [WorkerReceiver[myid=2]:FastLeaderElection$Messenger$WorkerReceiver@390] - Notification: my state:LOOKING; n.sid:2, n.state:LOOKING, n.leader:2, n.round:0x1, n.peerEpoch:0x4, n.zxid:0x400000001, message format version:0x2, n.config version:0x0
2022-10-06 20:09:27,775 [myid:2] - INFO  [QuorumConnectionThread-[myid=2]-3:QuorumCnxManager@514] - Have smaller server identifier, so dropping the connection: (myId:2 --&gt; sid:3)
2022-10-06 20:09:27,776 [myid:2] - INFO  [WorkerReceiver[myid=2]:FastLeaderElection$Messenger$WorkerReceiver@390] - Notification: my state:LOOKING; n.sid:1, n.state:FOLLOWING, n.leader:3, n.round:0x5, n.peerEpoch:0x4, n.zxid:0x30000003e, message format version:0x2, n.config version:0x0
2022-10-06 20:09:28,178 [myid:2] - INFO  [QuorumPeer[myid=2](plain=0.0.0.0:2181)(secure=disabled):FastLeaderElection@980] - Notification time out: 800
2022-10-06 20:09:28,178 [myid:2] - INFO  [WorkerReceiver[myid=2]:FastLeaderElection$Messenger$WorkerReceiver@390] - Notification: my state:LOOKING; n.sid:2, n.state:LOOKING, n.leader:2, n.round:0x1, n.peerEpoch:0x4, n.zxid:0x400000001, message format version:0x2, n.config version:0x0
2022-10-06 20:09:28,179 [myid:2] - INFO  [QuorumConnectionThread-[myid=2]-3:QuorumCnxManager@514] - Have smaller server identifier, so dropping the connection: (myId:2 --&gt; sid:3)
2022-10-06 20:09:28,180 [myid:2] - INFO  [WorkerReceiver[myid=2]:FastLeaderElection$Messenger$WorkerReceiver@390] - Notification: my state:LOOKING; n.sid:1, n.state:FOLLOWING, n.leader:3, n.round:0x5, n.peerEpoch:0x4, n.zxid:0x30000003e, message format version:0x2, n.config version:0x0
2022-10-06 20:09:28,504 [myid:2] - ERROR [ListenerHandler-tiny-cluster-zk-1.tiny-cluster-zk:3888:QuorumCnxManager$Listener$ListenerHandler@1094] - Exception while listening
java.net.SocketException: Unresolved address
	at java.base/java.net.ServerSocket.bind(Unknown Source)
	at java.base/java.net.ServerSocket.bind(Unknown Source)
	at org.apache.zookeeper.server.quorum.QuorumCnxManager$Listener$ListenerHandler.createNewServerSocket(QuorumCnxManager.java:1136)
	at org.apache.zookeeper.server.quorum.QuorumCnxManager$Listener$ListenerHandler.acceptConnections(QuorumCnxManager.java:1065)
	at org.apache.zookeeper.server.quorum.QuorumCnxManager$Listener$ListenerHandler.run(QuorumCnxManager.java:1034)
	at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Unknown Source)
	at java.base/java.util.concurrent.FutureTask.run(Unknown Source)
	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
	at java.base/java.lang.Thread.run(Unknown Source)
2022-10-06 20:09:28,982 [myid:2] - INFO  [QuorumPeer[myid=2](plain=0.0.0.0:2181)(secure=disabled):FastLeaderElection@980] - Notification time out: 1600
2022-10-06 20:09:28,982 [myid:2] - INFO  [WorkerReceiver[myid=2]:FastLeaderElection$Messenger$WorkerReceiver@390] - Notification: my state:LOOKING; n.sid:2, n.state:LOOKING, n.leader:2, n.round:0x1, n.peerEpoch:0x4, n.zxid:0x400000001, message format version:0x2, n.config version:0x0
2022-10-06 20:09:28,983 [myid:2] - INFO  [QuorumConnectionThread-[myid=2]-3:QuorumCnxManager@514] - Have smaller server identifier, so dropping the connection: (myId:2 --&gt; sid:3)
2022-10-06 20:09:28,985 [myid:2] - INFO  [WorkerReceiver[myid=2]:FastLeaderElection$Messenger$WorkerReceiver@390] - Notification: my state:LOOKING; n.sid:1, n.state:FOLLOWING, n.leader:3, n.round:0x5, n.peerEpoch:0x4, n.zxid:0x30000003e, message format version:0x2, n.config version:0x0
2022-10-06 20:09:29,504 [myid:2] - ERROR [ListenerHandler-tiny-cluster-zk-1.tiny-cluster-zk:3888:QuorumCnxManager$Listener$ListenerHandler@1094] - Exception while listening
java.net.SocketException: Unresolved address
	at java.base/java.net.ServerSocket.bind(Unknown Source)
	at java.base/java.net.ServerSocket.bind(Unknown Source)
	at org.apache.zookeeper.server.quorum.QuorumCnxManager$Listener$ListenerHandler.createNewServerSocket(QuorumCnxManager.java:1136)
	at org.apache.zookeeper.server.quorum.QuorumCnxManager$Listener$ListenerHandler.acceptConnections(QuorumCnxManager.java:1065)
	at org.apache.zookeeper.server.quorum.QuorumCnxManager$Listener$ListenerHandler.run(QuorumCnxManager.java:1034)
	at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Unknown Source)
	at java.base/java.util.concurrent.FutureTask.run(Unknown Source)
	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
	at java.base/java.lang.Thread.run(Unknown Source)
2022-10-06 20:09:30,505 [myid:2] - ERROR [ListenerHandler-tiny-cluster-zk-1.tiny-cluster-zk:3888:QuorumCnxManager$Listener$ListenerHandler@1113] - Leaving listener thread for address tiny-cluster-zk-1.tiny-cluster-zk:3888 after 3 errors. Use zookeeper.electionPortBindRetry property to increase retry count.
2022-10-06 20:09:30,505 [myid:2] - INFO  [QuorumPeerListener:QuorumCnxManager$Listener@980] - Leaving listener
2022-10-06 20:09:30,506 [myid:2] - ERROR [QuorumPeerListener:QuorumCnxManager$Listener@982] - As I'm leaving the listener thread, I won't be able to participate in leader election any longer: tiny-cluster-zk-1.tiny-cluster-zk:3888
2022-10-06 20:09:30,507 [myid:2] - ERROR [QuorumPeerListener:ServiceUtils@42] - Exiting JVM with code 14```

I’m actually running 3 replicas of ZK for stability but looks like Druid requires all three healthy or goes into NotReady state with logs I posted above

I just had to re-apply `tiny-cluster-hpq.yaml` which then able to connect to ZK servers and everything comes back normal

I believe this is the same issue I have seen.  In our k8s cluster we have the broker and coordinator pods automatically replaced once an hour and use HPA for them so it's a rolling replacement.  That appears to minimize disruption to the druid cluster.

basically force apply `tiny-cluster-hpq.yaml` every hour to make sure they got replaced.
even livenessProbe doesn’t help here?

From what I have observed, the liveness check doesn't include the health of the curator (zk interface connection).  This allows the curator to get into a bad state but the process itself is still considered healthy.  Ideally this would fail the health check and cause the pod to be replaced immediately.  I haven't looked into what that would require.

Yeah! I was expecting it to be replaced.
For now looks like I have to just apply the yaml again. Thanks for your valuable help.

Sorry, I ended up using password directly and lost the exception. I was stupid not to add exception message sorry.

Oh goodness no apologies necessary!

I'm glad you solved your problem. Thank you for taking some time to document your solution.

ohh so, I need middle managers :thinking_face: thanks, I will add that from the doc

Routers sit in front of the cluster and handle incoming HTTP requests.  They route the request to the appropriate node in the cluster (queries to brokers, supervisor cals go to the coordinators)

FYI if you are running in K8s I would suggest using Indexers instead of Middlemanagers.

found this helpful which will allow me to scale

<https://github.com/druid-io/druid-operator/issues/246>

Middlemanagers spawn child JVMs inside the container, but Indexers run inside the heap using a thread pool.  They are much more efficient and you can autoscale the pod count based on memory

got it! is there any doc that I can see to set that up?

actually! I have posted my entier yaml <https://apachedruidworkspace.slack.com/archives/C0309C9L90D/p1665152281061909?thread_ts=1665093419.807799&amp;cid=C0309C9L90D|here> and I was going to modify according to that issue I posted

You will need to define the indexers there also, there is docs for it.  The configs are very similar to the middlemanager

I used the config reference: <https://druid.apache.org/docs/0.23.0/configuration/index.html#indexer>

<https://github.com/apache/druid/tree/master/examples/conf/druid/cluster/data/indexer>

That's is the sample indexer configs from the cluster conf

Got it! I am going to set that up real quick thanks! :pray:

Screen Shot 2022-10-07 at 11.25.33 AM.png

QQ: From that doc looks like I can have either Middle Manager, Indexer or Historical is that correct?

You will always need historicals (they run queries against data segments)

Middlemanagers and Indexers are responsible for data ingest, you need one or the other

ahh! I just went back to architecture and confirmed that

was able to get indexer running but still don’t know how to get sub task to run. it stuck in pending and log just returns 404

main task shows
```022-10-07T18:13:40,051 INFO [task-runner-0-priority-0] org.apache.druid.indexing.common.task.batch.parallel.ParallelIndexPhaseRunner - Waiting for subTasks to be completed```

Can you share the Ingestion and services screenshots... are the tasks in a pending state?

task were in pending state but it all got removed for some reason, looks like my k8s setup is not making it persistent