Apache Druid #kubernetes-druid

Julian Reyes

10/30/2024, 2:15 PM

Cross-posting here deployment done in EKS so wonder if someone else has done something similar https://apachedruidworkspace.slack.com/archives/C0309C9L90D/p1729603680963549

Asif A

11/07/2024, 7:40 AM

Hi all, We have been deploying the druid mmless in AWS EKS cluster. We tried to increase some JVM configurations for peon, but the peon pods are not taking that configuration, eventhough the configurations are mounted in the peon pods. We are following the doc: https://druid.apache.org/docs/latest/development/extensions-contrib/k8s-jobs/. We tried changing the "druid-tiny-cluster-peons-config" configmap for updating the "jvm.config". The runtime.properties are getting reflected to peon process, but the jvm properties are not getting taken by peon. Help us to resolve this issue.

Sharmin Choksey

11/12/2024, 1:34 AM

Hello, trying to move to zk-less deployment on k8s. Everything runs fine with ingestion and query until it is time for task rollover which is when it fails with this error,

Copy code

2024-11-11T23:02:18,051 INFO [main] org.apache.druid.k8s.discovery.K8sDruidNodeDiscoveryProvider$NodeRoleWatcher - Stopping NodeRoleWatcher for [OVERLORD]...
2024-11-11T23:02:18,052 ERROR [org.apache.druid.k8s.discovery.K8sDruidNodeDiscoveryProvider$NodeRoleWatcheroverlord] org.apache.druid.k8s.discovery.K8sDruidNodeDiscoveryProvider$NodeRoleWatcher - Error while watching node type [OVERLORD]
java.lang.RuntimeException: IO Exception during hasNext method.
        at io.kubernetes.client.util.Watch.hasNext(Watch.java:183) ~[?:?]
        at org.apache.druid.k8s.discovery.DefaultK8sApiClient$2.hasNext(DefaultK8sApiClient.java:132) ~[?:?]
        at org.apache.druid.k8s.discovery.K8sDruidNodeDiscoveryProvider$NodeRoleWatcher.keepWatching(K8sDruidNodeDiscoveryProvider.java:269) ~[?:?]
        at org.apache.druid.k8s.discovery.K8sDruidNodeDiscoveryProvider$NodeRoleWatcher.watch(K8sDruidNodeDiscoveryProvider.java:238) ~[?:?]
        at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:539) ~[?:?]
        at java.util.concurrent.FutureTask.run(FutureTask.java:264) ~[?:?]
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136) ~[?:?]
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635) ~[?:?]
        at java.lang.Thread.run(Thread.java:840) ~[?:?]
Caused by: java.io.InterruptedIOException

We use the druid-operator and these are the configs,

Copy code

druid.zk.service.enabled=false
    druid.serverview.type=http
    druid.coordinator.loadqueuepeon.type=http
    druid.indexer.runner.type=httpRemote
    druid.discovery.type=k8s
    druid.discovery.k8s.clusterIdentifier=cc-apache-druid

It is NOT a MM-less deployment. Is there any mis-configuration to look for?

Noor

11/20/2024, 11:00 AM

Hello everyone I have included org.apache.druid.server.metrics.QueryCountStatsMonitor for historical runtime properties in kubernetes cluster. But I could not able to see any metrics related to query. These are the config for runtime properties of historical: druid.port=8083 druid.service=druid/historical druid.plaintextPort=8083 # HTTP server threads druid.server.http.numThreads=60 #auth disable for start up probes druid.auth.allowUnauthenticatedHttpOptions=true druid.auth.unsecuredPaths=["/status", "/druid/broker/v1/loadstatus", "/druid/historical/v1/loadstatus"] # Processing threads and buffers druid.processing.buffer.sizeBytes=100MiB druid.processing.numMergeBuffers=4 druid.processing.numThreads=15 #druid.processing.tmpDir=var/druid/processing druid.processing.tmpDir=/druid/data/processing druid.processing.formatString=historical-hot-tier # Segment storage druid.segmentCache.locations=[{"path":"/druid/historical/hot/segment-cache","maxSize":24000000000}] druid.server.maxSize=24000000000 # Query cache druid.historical.cache.useCache=true druid.historical.cache.populateCache=true druid.cache.type=caffeine druid.cache.sizeInBytes=256MiB druid.query.groupBy.maxOnDiskStorage=10G # Tier information druid.server.tier=hot # GroupByV2 config druid.query.groupBy.numParallelCombineThreads=2 # Segment Loading druid.segmentCache.numLoadingThreads=8 druid.segmentCache.numThreadsToLoadSegmentsIntoPageCacheOnDownload=2 druid.segmentCache.numThreadsToLoadSegmentsIntoPageCacheOnBootstrap=8 # Metrics Monitors druid.monitoring.monitors=["org.apache.druid.client.cache.CacheMonitor", "org.apache.druid.java.util.metrics.SysMonitor", "org.apache.druid.java.util.metrics.JvmMonitor", "org.apache.druid.java.util.metrics.JvmCpuMonitor", "org.apache.druid.java.util.metrics.JvmThreadsMonitor", "org.apache.druid.server.metrics.QueryCountStatsMonitor"]

Corwin Lester

11/26/2024, 3:14 PM

Is anyone running Druid via the https://github.com/datainfrahq/druid-operator on Openshift?

✅ 1

Corwin Lester

11/26/2024, 3:55 PM

Having issues getting the tmp directory created with the 'druid' user, but it might be on our end.

rishikesh

12/04/2024, 10:46 AM

Hello, I upgraded druid31 and login druid and found some error like " *NO management poxy mode*" and try to used below configuration for this to resolved but not resolved Can you please help me to fixed this... The console is running in restricted mode. The management proxy is disabled, the console will operate with limited functionality. For more info refer to the web console documentation. It is possible that the console is experiencing an issue with the capability detection. You can enable the unrestricted console, but certain features might not work if the underlying APIs are not available.

Krishna

12/06/2024, 7:12 PM

Hi, when we try to enable orc on MM EKS we are seeing this error . can some one please help https://apachedruidworkspace.slack.com/archives/C0309C9L90D/p1732688653392079

schmichri

12/16/2024, 2:33 PM

Hey all, we've published our Druid Cluster Configuration for our kubernetes native druid installation (k8s jobs, no zookeeper, HPA for historical nodes, etc) It's based on the druid-operator and flux-cd. Secrets managed by SOPS, all endpoints TLS encrypted https://github.com/iunera/druid-cluster-config incl. examples for authorization and authentication and the pac4j Setup vor Azure AD / EntryID. you might be interested in it.

🙌 2

schmichri

12/16/2024, 2:35 PM

ah this was part of an open source project funded by the german ministry of public transportation (Fahrbar20)

Mahesha Subrahamanya

01/16/2025, 8:45 PM

Hello Team, We are constructing below druid index parallel API ingestion in Java code with below payload some other property however in Druid 28, below API works and loaded into Druid. When we run the same API and s3 data file in Druid 31 which fails with below error. One more thing noticed that "timestampColumn" was causing this failure. I changed it to "timeStampColumn" then it worked. Is there any column name constraint in Druid v31.

Copy code

ERROR [task-runner-0-priority-0] org.apache.druid.indexing.common.stats.TaskRealtimeMetricsMonitor - [499] unparseable events discarded. 
request failed with 500

--data '{ "segmentGranularity": "ALL", "inputFormatType": "PARQUET", "inputType": "S3", "inputPaths": [ "s3://druid_ingest_testing/landing-zone1/xyz/12345/10mil/load_1/" ], "timestampColumn": "__time" }'

Copy code

webClient.post()
    .uri("/indexer/v1/task")
    .contentType(MediaType.APPLICATION_JSON)
    .bodyValue(ingestionTask)
    .retrieve()
    .bodyToMono(ingestionResultReference)

Noor

01/22/2025, 10:51 AM

HI team I am getting this error in our coordinator logs. Can someone please help here

2025-01-22T10:43:51,885 INFO [Thread-48] org.apache.druid.security.basic.authorization.db.updater.CoordinatorBasicAuthorizerMetadataStorageUpdater - CoordinatorBasicAuthorizerMetadataStorageUpdater is stopped.

2025-01-22T10:43:51,885 INFO [Thread-48] org.apache.druid.security.basic.authentication.db.updater.CoordinatorBasicAuthenticatorMetadataStorageUpdater - CoordinatorBasicAuthenticatorMetadataStorageUpdater is stopping.

2025-01-22T10:43:51,885 INFO [Thread-48] org.apache.druid.security.basic.authentication.db.updater.CoordinatorBasicAuthenticatorMetadataStorageUpdater - CoordinatorBasicAuthenticatorMetadataStorageUpdater is stopped.

2025-01-22T10:43:51,886 INFO [CoordinatorBasicAuthenticatorCacheNotifier-notifierThread-0] org.apache.druid.security.basic.CommonCacheNotifier - CoordinatorBasicAuthenticatorCacheNotifier: Interrupted while handling updates for cachedUserMaps. (java.lang.InterruptedException)

2025-01-22T10:43:51,886 INFO [CoordinatorBasicAuthorizerCacheNotifier-notifierThread-0] org.apache.druid.security.basic.CommonCacheNotifier - CoordinatorBasicAuthorizerCacheNotifier: Interrupted while handling updates for cachedUserMaps. (java.lang.InterruptedException)

2025-01-22T10:43:51,886 INFO [CoordinatorBasicAuthorizerCacheNotifier-notifierThread-0] org.apache.druid.security.basic.CommonCacheNotifier - CoordinatorBasicAuthorizerCacheNotifier: Interrupted while handling updates for cachedUserMaps. (java.lang.InterruptedException)

2025-01-22T10:43:51,902 ERROR [HttpClient-Netty-Worker-5] com.google.common.util.concurrent.ExecutionList - RuntimeException while executing runnable com.google.common.util.concurrent.Futures$4@27fc53a9 with executor java.util.concurrent.ScheduledThreadPoolExecutor@760f7d57[Terminated, pool size = 0, active threads = 0, queued tasks = 0, completed tasks = 17]

java.util.concurrent.RejectedExecutionException: Task java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask@64ad1398[Not completed, task = java.util.concurrent.Executors$RunnableAdapter@1c9b9c89[Wrapped task = com.google.common.util.concurrent.Futures$4@27fc53a9]] rejected from java.util.concurrent.ScheduledThreadPoolExecutor@760f7d57[Terminated, pool size = 0, active threads = 0, queued tasks = 0, completed tasks = 17]

at java.util.concurrent.ThreadPoolExecutor$AbortPolicy.rejectedExecution(ThreadPoolExecutor.java:2055) ~[?:?]

at java.util.concurrent.ThreadPoolExecutor.reject(ThreadPoolExecutor.java:825) ~[?:?]

at java.util.concurrent.ScheduledThreadPoolExecutor.delayedExecute(ScheduledThreadPoolExecutor.java:340) ~[?:?]

at java.util.concurrent.ScheduledThreadPoolExecutor.schedule(ScheduledThreadPoolExecutor.java:562) ~[?:?]

at java.util.concurrent.ScheduledThreadPoolExecutor.execute(ScheduledThreadPoolExecutor.java:705) ~[?:?]

at com.google.common.util.concurrent.ExecutionList.executeListener(ExecutionList.java:156) ~[guava-16.0.1.jar:?]

at com.google.common.util.concurrent.ExecutionList.execute(ExecutionList.java:145) ~[guava-16.0.1.jar:?]

2025-01-22T10:43:51,907 INFO [HttpClient-Netty-Worker-5] org.apache.druid.java.util.http.client.pool.ResourcePool - giveBack called after being closed. key[<https://240.38.33.85:8283>]

2025-01-22T10:43:51,913 ERROR [HttpClient-Netty-Worker-6] com.google.common.util.concurrent.ExecutionList - RuntimeException while executing runnable com.google.common.util.concurrent.Futures$4@3fd43c9 with executor java.util.concurrent.ScheduledThreadPoolExecutor@760f7d57[Terminated, pool size = 0, active threads = 0, queued tasks = 0, completed tasks = 17]

java.util.concurrent.RejectedExecutionException: Task java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask@409efbf0[Not completed, task = java.util.concurrent.Executors$RunnableAdapter@78d28b8c[Wrapped task = com.google.common.util.concurrent.Futures$4@3fd43c9]] rejected from java.util.concurrent.ScheduledThreadPoolExecutor@760f7d57[Terminated, pool size = 0, active threads = 0, queued tasks = 0, completed tasks = 17]

at java.util.concurrent.ThreadPoolExecutor$AbortPolicy.rejectedExecution(ThreadPoolExecutor.java:2055) ~[?:?]

at java.util.concurrent.ThreadPoolExecutor.reject(ThreadPoolExecutor.java:825) ~[?:?]

at java.util.concurrent.ScheduledThreadPoolExecutor.delayedExecute(ScheduledThreadPoolExecutor.java:340) ~[?:?]

at java.util.concurrent.ScheduledThreadPoolExecutor.schedule(ScheduledThreadPoolExecutor.java:562) ~[?:?]

at java.util.concurrent.ScheduledThreadPoolExecutor.execute(ScheduledThreadPoolExecutor.java:705) ~[?:?]

at com.google.common.util.concurrent.ExecutionList.executeListener(ExecutionList.java:156) ~[guava-16.0.1.jar:?]

at com.google.common.util.concurrent.ExecutionList.execute(ExecutionList.java:145) ~[guava-16.0.1.jar:?

2025-01-22T10:43:51,913 INFO [HttpClient-Netty-Worker-6] org.apache.druid.java.util.http.client.pool.ResourcePool - giveBack called after being closed. key[<https://XX:8283>]

2025-01-22T10:43:51,914 ERROR [HttpClient-Netty-Worker-7] com.google.common.util.concurrent.ExecutionList - RuntimeException while executing runnable com.google.common.util.concurrent.Futures$4@473573be with executor java.util.concurrent.ScheduledThreadPoolExecutor@760f7d57[Terminated, pool size = 0, active threads = 0, queued tasks = 0, completed tasks = 17]

java.util.concurrent.RejectedExecutionException: Task java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask@2e2ef737[Not completed, task = java.util.concurrent.Executors$RunnableAdapter@1f7becba[Wrapped task = com.google.common.util.concurrent.Futures$4@473573be]] rejected from java.util.concurrent.ScheduledThreadPoolExecutor@760f7d57[Terminated, pool size = 0, active threads = 0, queued tasks = 0, completed tasks = 17]

at java.util.concurrent.ThreadPoolExecutor$AbortPolicy.rejectedExecution(ThreadPoolExecutor.java:2055) ~[?:?]

at java.util.concurrent.ThreadPoolExecutor.reject(ThreadPoolExecutor.java:825) ~[?:?]

at java.util.concurrent.ScheduledThreadPoolExecutor.delayedExecute(ScheduledThreadPoolExecutor.java:340) ~[?:?]

at java.util.concurrent.ScheduledThreadPoolExecutor.schedule(ScheduledThreadPoolExecutor.java:562) ~[?:?]

at java.util.concurrent.ScheduledThreadPoolExecutor.execute(ScheduledThreadPoolExecutor.java:705) ~[?:?]

at com.google.common.util.concurrent.ExecutionList.executeListener(ExecutionList.java:156) ~[guava-16.0.1.jar:?]

at com.google.common.util.concurrent.ExecutionList.execute(ExecutionList.java:145) ~[guava-16.0.1.jar:?]

2025-01-22T10:43:51,915 INFO [HttpClient-Netty-Worker-7] org.apache.druid.java.util.http.client.pool.ResourcePool - giveBack called after being closed. key[<https://XXX:8283>]

2025-01-22T10:43:51,915 ERROR [HttpClient-Netty-Worker-8] com.google.common.util.concurrent.ExecutionList - RuntimeException while executing runnable com.google.common.util.concurrent.Futures$4@34e15537 with executor java.util.concurrent.ScheduledThreadPoolExecutor@760f7d57[Terminated, pool size = 0, active threads = 0, queued tasks = 0, completed tasks = 17]

java.util.concurrent.RejectedExecutionException: Task java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask@5c456b81[Not completed, task = java.util.concurrent.Executors$RunnableAdapter@d1cad3e[Wrapped task = com.google.common.util.concurrent.Futures$4@34e15537]] rejected from java.util.concurrent.ScheduledThreadPoolExecutor@760f7d57[Terminated, pool size = 0, active threads = 0, queued tasks = 0, completed tasks = 17]

at java.util.concurrent.ThreadPoolExecutor$AbortPolicy.rejectedExecution(ThreadPoolExecutor.java:2055) ~[?:?]

at java.util.concurrent.ThreadPoolExecutor.reject(ThreadPoolExecutor.java:825) ~[?:?]

01/28/2025, 1:50 PM

Hi we recently migrated our dev servers from EC2 to EKS. Started seeing an issue where Datasources shows one segment to load but Services shows empty load/drop queues. Historical and Coordinator logs show segment load exception but unclear what is causing the exception. No errors on Zookeeper or S3. Any ideas?

Noor

03/21/2025, 6:46 AM

Hi team, We are getting all the metrics of druid exporter for our gke cluster. But not sure why we are not getting kafka related metrics , for example we need to monitor often for ingest/kafka/maxLag. Can someone guide here. For VM cluster we are getting but not sure why we are not getting for kafka related metrics in gke side

Goran Zaric

04/07/2025, 1:17 PM

Hello team 👋 Running Druid

v30.0.0

and we'd like to go for

v32.0.1

. Noticed couple of incompatible changes but is it safe to jump over 2 major versions? Current druid operator version is

v1.2.3

. How you would recommend approaching the upgrade?

Sergei

04/28/2025, 6:05 PM

Hi team, I am trying to deploy Imply to AWS EKS cluster managed by Karpenter. Is there any good getting started document ?

Luke Foskey

05/01/2025, 3:23 AM

Hi team, been sometime but is there any plans to update https://operatorhub.io/operator/druid-operator to align with the helm based operator, we'd prefer to maintain our current deployment workflow via operatorhub

Jb Graindorge

05/01/2025, 2:44 PM

Hi, I am using K8S MM-less feature.

druid.processing.intermediaryData.storage.type=deepstore

has been setup but when an ingestion start I get this error (truncated)

Copy code

2025-05-01T14:41:46,595 INFO [main] org.apache.druid.cli.CliPeon - Task file not found, trying to pull task payload from deep storage                                                                              │
│ 2025-05-01T14:41:46,827 ERROR [main] org.apache.druid.cli.CliPeon - Failed to start, 1 errors                                                                                                                      │
│ 2025-05-01T14:41:46,828 ERROR [main] org.apache.druid.cli.CliPeon - java.lang.IllegalStateException: Optional.get() cannot be called on an absent value                                                            │
│ java.lang.IllegalStateException: Optional.get() cannot be called on an absent value                                                                                                                                │
│     at com.google.common.base.Absent.get(Absent.java:44) ~[guava-32.0.1-jre.jar:?]                                                                                                                                 │
│     at org.apache.druid.cli.CliPeon$1.readTask(CliPeon.java:317) ~[druid-services-2025.01.0-iap.jar:2025.01.0-iap]                                                                                                 │
│     at org.apache.druid.cli.CliPeon$1$$FastClassByGuice$$44261020.GUICE$TRAMPOLINE(<generated>) ~[?:2025.01.0-iap]                                                                                                 │
│     at org.apache.druid.cli.CliPeon$1$$FastClassByGuice$$44261020.apply(<generated>) ~[?:2025.01.0-iap]                                                                                                            │
│     at com.google.inject.internal.ProviderMethod$FastClassProviderMethod.doProvision(ProviderMethod.java:260) ~[guice-5.1.0.jar:?]                                                                                 │
│     at com.google.inject.internal.ProviderMethod.doProvision(ProviderMethod.java:171) ~[guice-5.1.0.jar:?]                                                                                                         │
│     at com.google.inject.internal.InternalProviderInstanceBindingImpl$CyclicFactory.provision(InternalProviderInstanceBindingImpl.java:185) ~[guice-5.1.0.jar:?]                                                   │
│     at com.google.inject.internal.InternalProviderInstanceBindingImpl$CyclicFactory.get(InternalProviderInstanceBindingImpl.java:162) ~[guice-5.1.0.jar:?]                                                         │
│     at com.google.inject.internal.ProviderToInternalFactoryAdapter.get(ProviderToInternalFactoryAdapter.java:40) ~[guice-5.1.0.jar:?]                                                                              │
│     at com.google.inject.internal.SingletonScope$1.get(SingletonScope.java:169) ~[guice-5.1.0.jar:?]                                                                                                               │
│     at com.google.inject.internal.InternalFactoryToProviderAdapter.get(InternalFactoryToProviderAdapter.java:45) ~[guice-5.1.0.jar:?]                                                                              │
│     at com.google.inject.internal.SingleParameterInjector.inject(SingleParameterInjector.java:40) ~[guice-5.1.0.jar:?]                                                                                             │
│     at com.google.inject.internal.SingleParameterInjector.getAll(SingleParameterInjector.java:60) ~[guice-5.1.0.jar:?]                                                                                             │
│     at com.google.inject.internal.ProviderMethod.doProvision(ProviderMethod.java:171) ~[guice-5.1.0.jar:?]                                                                                                         │
│     at com.google.inject.internal.InternalProviderInstanceBindingImpl$CyclicFactory.

Not sure what is wrong and why the task payload is not written in S3 (everything is ok in IAM)

Jb Graindorge

05/06/2025, 9:36 AM

Hello team, We are using Druid on EKS (AWS) and peons are giving these errors :

Copy code

2025-05-06T09:16:45,015 ERROR [main] org.apache.druid.java.util.metrics.cgroups.Cpu - Unable to fetch cpu snapshot
2025-05-06T09:16:45,020 WARN [main] org.apache.druid.java.util.metrics.CgroupUtil - Unable to fetch cpu.shares
2025-05-06T09:16:45,021 WARN [main] org.apache.druid.java.util.metrics.CgroupUtil - Unable to fetch cpu.cfs_quota_us
2025-05-06T09:16:45,021 WARN [main] org.apache.druid.java.util.metrics.CgroupUtil - Unable to fetch cpu.cfs_period_us
2025-05-06T09:16:45,023 ERROR [main] org.apache.druid.java.util.metrics.cgroups.CpuSet - Unable to read cpuset.cpus
2025-05-06T09:16:45,023 ERROR [main] org.apache.druid.java.util.metrics.cgroups.CpuSet - Unable to read cpuset.effective_cpus
2025-05-06T09:16:45,024 ERROR [main] org.apache.druid.java.util.metrics.cgroups.CpuSet - Unable to read cpuset.mems
2025-05-06T09:16:45,024 ERROR [main] org.apache.druid.java.util.metrics.cgroups.CpuSet - Unable to read cpuset.effective_mems

I tried to add

Copy code

-Ddruid.metrics.cgroup.cpuPath=/sys/fs/cgroup/cpu.max
-Ddruid.metrics.cgroup.cpuSharesPath=/sys/fs/cgroup/cpu.weight
-Ddruid.metrics.cgroup.cpuQuotaPath=/sys/fs/cgroup/cpu.max
-Ddruid.metrics.cgroup.cpuPeriodPath=/sys/fs/cgroup/cpu.max
-Ddruid.metrics.cgroup.cpusetPath=/sys/fs/cgroup/cpuset.cpus
-Ddruid.metrics.cgroup.effectiveCpuSetPath=/sys/fs/cgroup/cpuset.cpus.effective
-Ddruid.metrics.cgroup.memPath=/sys/fs/cgroup/cpuset.mems
-Ddruid.metrics.cgroup.effectiveMemPath=/sys/fs/cgroup/cpuset.mems.effective

to the peon JAVA_OPTS but errors are still here... any idea please ? It looks like EKS is running cgroup v2, is it compatible with Druid ?

Jb Graindorge

05/06/2025, 11:22 AM

actually all the processes are giving these errors, even with the

Copy code

druid.monitoring.monitors=["org.apache.druid.java.util.metrics.JvmMonitor", "org.apache.druid.java.util.metrics.OshiSysMonitor", "org.apache.druid.java.util.metrics.JvmCpuMonitor", "org.apache.druid.client.cache.CacheMonitor", "org.apache.druid.server.metrics.QueryCountStatsMonitor", "org.apache.druid.java.util.metrics.CgroupV2CpuMonitor","org.apache.druid.java.util.metrics.CgroupV2DiskMonitor","org.apache.druid.java.util.metrics.CgroupV2MemoryMonitor"]

Jb Graindorge

05/07/2025, 7:31 AM

we were able to hide the errors/warnings within log4j2 conf

Copy code

<Logger name="org.apache.druid.java.util.metrics" level="off" additivity="false">
                <AppenderRef ref="Console"/>
            </Logger>

We still don't know why we got these errors, we will probably raise an issue with Imply support and/or on the github repo

Jon Laberge

05/21/2025, 7:05 PM

hey 👋 , I am currently working on getting druid running on GKE. I have a kafka indexing task that seems to complete fine, but never fully completes the tasks, instead it will hang for a while and then fail. My setup is roughly: • druid deployed with druid-operator using k8s discovery / ie zkless • coordinator playing role of overlord as well • deepstorage=google cloud storage • metadata = cloudsql / postgres Here is the log:

Copy code

2025-05-21T13:21:08,727 INFO [main] org.apache.druid.k8s.discovery.K8sDruidNodeDiscoveryProvider$NodeRoleWatcher - Stopping NodeRoleWatcher for [OVERLORD]...
2025-05-21T13:21:08,729 ERROR [org.apache.druid.k8s.discovery.K8sDruidNodeDiscoveryProvider$NodeRoleWatcheroverlord] org.apache.druid.k8s.discovery.K8sDruidNodeDiscoveryProvider$NodeRoleWatcher - Error while watching node type [OVERLORD]
java.lang.RuntimeException: IO Exception during hasNext method.
	at io.kubernetes.client.util.Watch.hasNext(Watch.java:183) ~[?:?]
	at org.apache.druid.k8s.discovery.DefaultK8sApiClient$2.hasNext(DefaultK8sApiClient.java:132) ~[?:?]
	at org.apache.druid.k8s.discovery.K8sDruidNodeDiscoveryProvider$NodeRoleWatcher.keepWatching(K8sDruidNodeDiscoveryProvider.java:269) ~[?:?]
	at org.apache.druid.k8s.discovery.K8sDruidNodeDiscoveryProvider$NodeRoleWatcher.watch(K8sDruidNodeDiscoveryProvider.java:238) ~[?:?]
	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:539) ~[?:?]
	at java.util.concurrent.FutureTask.run(FutureTask.java:264) ~[?:?]
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136) ~[?:?]
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635) ~[?:?]
	at java.lang.Thread.run(Thread.java:840) ~[?:?]
Caused by: java.io.InterruptedIOException
	at okhttp3.internal.http2.Http2Stream.waitForIo$okhttp(Http2Stream.kt:660) ~[?:?]
	at okhttp3.internal.http2.Http2Stream$FramingSource.read(Http2Stream.kt:376) ~[?:?]
	at okhttp3.internal.connection.Exchange$ResponseBodySource.read(Exchange.kt:281) ~[?:?]
	at okio.RealBufferedSource.exhausted(RealBufferedSource.kt:200) ~[?:?]
	at io.kubernetes.client.util.Watch.hasNext(Watch.java:181) ~[?:?]
	... 8 more
2025-05-21T13:21:08,733 ERROR [org.apache.druid.k8s.discovery.K8sDruidNodeDiscoveryProvider$NodeRoleWatcheroverlord] org.apache.druid.k8s.discovery.K8sDruidNodeDiscoveryProvider$NodeRoleWatcher - Expection while watching for NodeRole [OVERLORD].
java.lang.RuntimeException: java.lang.InterruptedException
	at org.apache.druid.concurrent.LifecycleLock$Sync.awaitStarted(LifecycleLock.java:144) ~[druid-processing-29.0.1.jar:29.0.1]
	at org.apache.druid.concurrent.LifecycleLock.awaitStarted(LifecycleLock.java:245) ~[druid-processing-29.0.1.jar:29.0.1]
	at org.apache.druid.k8s.discovery.K8sDruidNodeDiscoveryProvider$NodeRoleWatcher.keepWatching(K8sDruidNodeDiscoveryProvider.java:258) ~[?:?]
	at org.apache.druid.k8s.discovery.K8sDruidNodeDiscoveryProvider$NodeRoleWatcher.watch(K8sDruidNodeDiscoveryProvider.java:238) ~[?:?]
	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:539) ~[?:?]
	at java.util.concurrent.FutureTask.run(FutureTask.java:264) ~[?:?]
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136) ~[?:?]
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635) ~[?:?]
	at java.lang.Thread.run(Thread.java:840) ~[?:?]
Caused by: java.lang.InterruptedException
	at java.util.concurrent.locks.AbstractQueuedSynchronizer.tryAcquireSharedNanos(AbstractQueuedSynchronizer.java:1081) ~[?:?]
	at org.apache.druid.concurrent.LifecycleLock$Sync.awaitStarted(LifecycleLock.java:139) ~[druid-processing-29.0.1.jar:29.0.1]
	... 8 more
2025-05-21T13:21:18,734 INFO [org.apache.druid.k8s.discovery.K8sDruidNodeDiscoveryProvider$NodeRoleWatcheroverlord] org.apache.druid.k8s.discovery.K8sDruidNodeDiscoveryProvider$NodeRoleWatcher - Exited Watch for NodeRole [OVERLORD].
2025-05-21T13:21:18,734 INFO [main] org.apache.druid.k8s.discovery.K8sDruidNodeDiscoveryProvider$NodeRoleWatcher - Stopped NodeRoleWatcher for [OVERLORD].
2025-05-21T13:21:18,735 INFO [main] org.apache.druid.k8s.discovery.K8sDruidNodeDiscoveryProvider$NodeRoleWatcher - Stopping NodeRoleWatcher for [COORDINATOR]...

06/09/2025, 4:56 PM

Hi we have Druid setup over EKS. S3 for deep storage. Recently tried to configure local path for druid_indexer_task_baseTaskDir=/tmp/indexing-tasks but I still see NFS path being used /opt/druid/var/druid/task.... Is there any other property I need to set, what am I missing? This is causing errors saying Incremental persist failed, Cannot delete file etc.

schmichri

06/25/2025, 5:58 AM

I want to reshare this link https://github.com/iunera/druid-cluster-config/ because I see similar repeating issues which are solved in the mentioned configs.

Chris Warren

07/17/2025, 7:07 PM

Is anyone running MM-less (Kubernetes Task Runner) ingestion in production or serious testing? We’re trying it now on Druid 31.x and would love to trade notes

Eyal Yurman

08/06/2025, 11:33 AM

Has anyone tried running ingestion (Indexing tasks) and queries (brokers+historicals) on different Kubernetes clusters? Either MM-less or not. I have a use case where I want to run ingestion on AWS with spot instances, and commit the segments to deep storage in on-prem cluster.

Karan Kumar

08/12/2025, 8:54 AM

Hey Minor PR out for review : https://github.com/datainfrahq/druid-operator/pull/235 @Adheip Singh Could use a review here.

🙌 1

Sanjay Dowerah

09/03/2025, 10:36 AM

👋 Hello, team!

Sanjay Dowerah

09/03/2025, 10:39 AM

I am running Druid on an Openshift cluster, and using the Druid Delta Lake extension(https://github.com/apache/druid/tree/master/extensions-contrib/druid-deltalake-extensions) to connect and load Delta tables. However, I am running into the following issue, • error while loading with delta connector: only 1024 records of each constituent parquet file is loaded ◦ Update: The default result row limit for multi stage query using the Delta Lake extension is 1024. As the delta lake connection does not allow to edit the ingestion spec, looking for a solution to override the default

Sanjay Dowerah

10/08/2025, 8:06 AM

Hello Druid Community, Apologies for repeating this, just wanted to keep the loop alive I am running Druid on an Openshift cluster, and using the Druid Delta Lake extension(https://github.com/apache/druid/tree/master/extensions-contrib/druid-deltalake-extensions) to connect and load Delta tables. However, I am running into the following issue, • error while loading with delta connector: only 1024 records of each constituent parquet file(each partition of the delta table) is loaded Also, there is an an error on the UI as soon as the load is over • ERROR: Request failed with status code 404 For your reference, here is the query I am using to load, REPLACE INTO "table" OVERWRITE ALL WITH "ext" AS ( SELECT * FROM TABLE( EXTERN( '{"type":"delta","tablePath":"path"}', '{"type":"parquet"}' ) ) EXTEND ("col1" VARCHAR, "col2" VARCHAR, "col3" VARCHAR, "col4" BIGINT, "col5" VARCHAR, "col6" VARCHAR, "col7" BIGINT, "col8" VARCHAR, "col9" VARCHAR) ) SELECT MILLIS_TO_TIMESTAMP("dop" * 1000) AS "__time", "col1", "col2", "col3", "col4", "col5", "col6", "col7", "col8" FROM "ext" PARTITIONED BY DAY