https://linen.dev logo
Join Slack
Powered by
# troubleshooting
  • r

    Rohen

    06/09/2025, 5:53 AM
    Anyone implemented druid-security with druid deployed using HELM EKS ?
  • r

    Rushikesh Bankar

    06/09/2025, 10:32 AM
    Hi Team 👋 We recently discovered an issue with the kubernetes service discovery created this with more details- https://github.com/apache/druid/issues/18090 I will try to summarize it • the current implementation of the kubernetes service discovery leaves the responsibility to announce or unannounce the node/pod on the node/pod itself • But this causes issues when the pod is shutdown abruptly due to node not ready or node failures from the CSP • We observed 3-4 instances where such issue happened and druid master nodes and broker continued to detect the falty pod and resulted in monotonically increasing loadqueue size and choking of jetty threads on broker due to all the queries timing out on the faulty historical node. This had a very extended impact as opposed to what we see with any zk based druid cluster I am proposing this fix- https://github.com/apache/druid/pull/18089 this has been tested on the druid cluster by reproducing the abrupt termination could you please help me with the review? Thanks! cc: @kfaraz
    k
    • 2
    • 5
  • j

    JRob

    06/10/2025, 3:19 PM
    Our calls to
    sys.segments
    are taking upwards of 60 seconds on average. Likewise our Datasources tab in the Console takes an agonizingly long time to load. But I can't understand why it's so slow, our DB stats don't show any issues. The
    druid_segments
    table is only 1108 MB in size. From pg_stat_statements:
    Copy code
    query            | SELECT payload FROM druid_segments WHERE used=$1
    calls            | 734969
    total_exec_time  | 1318567198.0990858
    min_exec_time    | 733.308662
    max_exec_time    | 13879.650989
    mean_exec_time   | 1794.0446441947086
    stddev_exec_time | 581.4299142612549
    ----------------------------------------------
    query            | SELECT payload FROM druid_segments WHERE used = $1 AND dataSource = $2 AND ((start < $3 AND "end" > $4) OR (start = $7 AND "end" != $8 AND "end" > $5) OR (start != $9 AND "end" = $10 AND start < $6) OR (start = $11 AND "end" = $12))
    calls            | 4888478
    total_exec_time  | 31912869.00381691
    min_exec_time    | 0.007730999999999999
    max_exec_time    | 2166.647028
    mean_exec_time   | 6.528180960171064
    stddev_exec_time | 25.333075336970094
    ----------------------------------------------
    b
    a
    • 3
    • 25
  • d

    Dinesh

    06/12/2025, 4:55 AM
    Hello There is one problem we are facing now a days. When the batch ingestion tasks are running (index_parallel) when the task is about to complete , first the task status changes to 'None' and eventually task is failing with an error though all looks good during task execution. Unknown exception / org.apache.druid.rpc.ServiceNotAvailableException: Service [overlord] issued redirect to unknown URL [http://10.XX.XX.18:8081/druid/indexer/v1/tasks] / java.lang.RuntimeException
  • d

    Dinesh

    06/12/2025, 5:26 AM
    it has become a big bottleneck for us can someone please guide on this ?
    b
    g
    • 3
    • 5
  • r

    Riccardo Sale

    06/16/2025, 10:11 AM
    Hello ! Our use case of Druid is particular since we have thousands of datasources. We have recently experienced RDS CPU spikes during metric creation that have then been mitigated by modifying the following value:
    druid.audit.manager.maxPayloadSizeBytes
    Looking at the
    coordinator.compaction.config
    field we have seen that this json payload value have grown to over 30MB and it's still causing slowdown when queried. As an example the following query:
    SELECT payload FROM druid_segments WHERE used=?
    takes up to three second. Any suggestion to solve the above issue ? How can we reduce the general size of the payload in the
    coordinator.compaction.config
    ? Would it be possible to write a custom extension for this specific use case ? Thanks in advance !
    b
    g
    • 3
    • 9
  • r

    Rajesh Gottapu

    06/17/2025, 5:19 AM
    Hi All druid is crashing with below exception in supervisor logs. Any help would be appreciated. Thanks { "timestamp": "2025-06-17T045207.071Z", "exceptionClass": "org.apache.druid.rpc.ServiceClosedException", "message": "org.apache.druid.rpc.ServiceClosedException: Service [index_kafka_zpn_pse_0ed475b16f43ae1_gedegecj] is closed", "streamException": false }, { "timestamp": "2025-06-17T045618.294Z", "exceptionClass": "org.apache.druid.rpc.ServiceClosedException", "message": "org.apache.druid.rpc.ServiceClosedException: Service [index_kafka_zpn_pse_datasource_ed342e7ec84bbb9_hjdhclha] is closed", "streamException": false }, { "timestamp": "2025-06-17T045635.179Z", "exceptionClass": "org.apache.druid.rpc.ServiceClosedException", "message": "org.apache.druid.rpc.ServiceClosedException: Service [index_kafka_zpn_pse_datasource_ed342e7ec84bbb9_klpmnmkc] is closed", "streamException": false }
    g
    • 2
    • 3
  • n

    Nir Bar On

    06/17/2025, 11:06 AM
    Hey all, working with druid 25.0.0 , configuration of auto-compaction task show this , but on documentation this filed has defult value .. question - what is meaning of legacy setting , is inputSegmentSizeBytes is depricated / use / notUsed underline for compaction ?
    b
    • 2
    • 2
  • n

    Nir Bar On

    06/17/2025, 12:30 PM
    question regarding middle-manager / injestion task - is druid task can be configure to firs validate that disk space on “/var/druid/task” (directory use for task / segments creation) is behind some treshold .. before starting task execution .. as I had some cases got out of space exception on disk level , can we have some validation on the disk size before task starting ?
  • n

    Nir Bar On

    06/17/2025, 12:48 PM
    on compaction task I discover that some point in time capcity on /var/druid/task” directory is 1.7G on disk , can I reduce the max disk space compaction task took by changing some configuration on compaction task ?
  • c

    Cristi Aldulea

    06/18/2025, 7:46 AM
    Hi all, I'm working with Apache Druid and have introduced a second timestamp column called
    ingestionTimestamp
    to support a deduplication job. Additionally, I have a column named
    tags
    , which is a multi-value
    VARCHAR
    column. The deduplication is performed using an MSQ (Multi-Stage Query) like the following:
    Copy code
    REPLACE INTO "target-datasource" 
    OVERWRITE 
    WHERE "__time" >= TIMESTAMP'__MIN_TIME' 
      AND "__time" < TIMESTAMP'__MAX_TIME'
    
    SELECT 
        __time,
        LATEST_BY("entityId", MILLIS_TO_TIMESTAMP("ingestionTimestamp")) AS "entityId",
        LATEST_BY("entityName", MILLIS_TO_TIMESTAMP("ingestionTimestamp")) AS "entityName",
        LATEST_BY("tagSetA", MILLIS_TO_TIMESTAMP("ingestionTimestamp")) AS "tagSetA",
        LATEST_BY("tagSetB", MILLIS_TO_TIMESTAMP("ingestionTimestamp")) AS "tagSetB",
        MAX("ingestionTimestamp") AS ingestionTimestamp
    FROM "target-datasource"
    WHERE "__time" >= TIMESTAMP'__MIN_TIME' 
      AND "__time" < TIMESTAMP'__MAX_TIME'
    GROUP BY 
        __time, 
        "entityUID"
    PARTITIONED BY 'P1M';
    Problem: After running this query, the
    tags
    -like columns (
    tagSetA
    ,
    tagSetB
    ) are no longer in a multi-value format. This breaks downstream queries that rely on the multi-value nature of these columns. My understanding: MSQ might not support preserving multi-value columns directly, especially when using functions like
    LATEST_BY
    . Question: How can I run this kind of deduplication query while preserving the multi-value format of these columns? Is there a recommended approach or workaround in Druid to handle this scenario? Can someone help us with this problem, please?
    k
    g
    • 3
    • 4
  • v

    Vaibhav

    06/18/2025, 7:16 PM
    Hi all, I’m running into an issue with range-based partitioning ( Druid 27.0 ) during compaction on one of our heaviest datasources and would appreciate input from the community. Context: • Datasource is ingested via Kafka indexing (stream ingestion). • Daily volume: ~4 billion rows / ~110 GB uncompressed data. • Ingested with HOUR granularity, resulting in ~5,000 segments per day. • We run daily compaction with range partitioning on 5 dimensions • Compaction task uses 8 parallel subtasks with 4 GB heap each. Issue: - Compaction fails during the final segment merge phase. - First failure was heap OOM, which was resolved by increasing task heap from 3 GB → 4 GB. - Now getting the following error:
    Copy code
    org.apache.druid.java.util.common.IAE: Asked to add buffers[2,454,942,764] larger than configured max[2,147,483,647]
    at org.apache.druid.java.util.common.io.smoosh.FileSmoosher.addWithSmooshedWriter(FileSmoosher.java:168)
    • On investigation: compaction produces
    430 partitions
    , but the 430th partition (with
    end=null
    ) gets an unusually high number of rows (~800M+ rows). What I found: - A
    GROUP BY
    on the 5 range dimensions for a sample day gives ~11.5k unique combinations
    Copy code
    eg : 
    SELECT range_dim1, range_dim2, range_dim3, range_dim4, range_dim5 , count(*) as row_count
    WHERE __time < 1 day interval > 
    GROUP BY 1,2,3,4,5
    ORDER BY 1,2,3,4,5
    - However, partition 430 gets all combinations from ~9.5k to ~11.5k in one partition. - This violates the
    targetRowsPerSegment: 5M
    and
    maxRowsPerSegment: 7.5M
    config. Questions: • Are there better strategies to ensure partitioning respects row count limits ? • Is this behavior a bug or expected ? Any advice or insights appreciated.
    j
    b
    • 3
    • 9
  • l

    Lionel Mena

    06/20/2025, 9:41 AM
    Hello all, I have a question regarding streaming ingestion tasks, once the taskDuration is completed and the supervisor start the task rolling, most of the task take around 2 or 3 min to actually finish. The supervisor throughput fluctuates between 1M - 8MB messages/sec Is this normal ? because I see some warning messages of a retry logic beign triggered logs but this account for around 10 seconds . I'm attaching a log of one the tasks
    realtimet_tasklog.txt
  • s

    Stefanos Pliakos

    06/25/2025, 11:56 AM
    Hello! I am trying to setup the DataInfraHQ Druid Operator in a kubernetes cluster. I have configured TLS according to the CRD which seems ok. Furthermore I have enabled readiness probes for the discreet nodes (eg brokers, coordinators etc), eg:
    Copy code
    readinessProbe:
            httpGet:
              path: /status/health
              port: 8082
              scheme: HTTPS
            initialDelaySeconds: 30
            periodSeconds: 10
            timeoutSeconds: 5
    However, this way the startup probe is failing, an example for coordinators:
    Copy code
    Warning  Unhealthy  40s (x2 over 50s)  kubelet            Startup probe failed: Get "<http://172.31.75.247:8081/status/health>": dial tcp 172.31.75.247:8081: connect: connection refused
      Warning  Unhealthy  0s (x4 over 29s)   kubelet            Startup probe failed: Get "<http://172.31.75.247:8081/status/health>": net/http: HTTP/1.x transport connection broken: malformed HTTP response "\x15\x03\x03\x00\x02\x02P"
    In the CRD there is no configuration for startupProbe so I can’t even change the scheme to HTTPS. Any insights on this? Or do I have to disable all TLS configuration already? My configuration is:
    Copy code
    druid.enablePlaintextPort=false
        druid.enableTlsPort=true
        druid.client.https.protocol=TLSv1.2
        druid.client.https.trustStorePath=/opt/druid/conf/druid/cluster/_common/tls/truststore.jks
        druid.client.https.trustStorePassword=${env:TRUSTSTORE_PASSWORD}
        druid.client.https.trustStoreType=jks
        druid.server.https.keyStorePath=/opt/druid/conf/druid/cluster/_common/tls/keystore.jks
        druid.server.https.keyStorePassword=${env:KEYSTORE_PASSWORD}
        druid.server.https.keyStoreType=jks
  • n

    Nir Bar On

    06/25/2025, 12:14 PM
    Hi I want to enable -
    org.apache.druid.server.metrics.TaskCountStatsMonitor
    question is - on which component / s on druid stack it can be enabled ?
    k
    a
    • 3
    • 4
  • n

    Nir Bar On

    06/25/2025, 2:13 PM
    is “indexer” == “middle-manager” , or each of them is diffrent component , and if so what the main diffrent between indexer to middle-manger ?
    b
    • 2
    • 1
  • n

    Nir Bar On

    06/25/2025, 8:56 PM
    hello we are currently have druid 25.0.0 , consider now to upgrade to newer version 1 . which what is the recommended version for upgrade from 25.0.0 , or can I pick the latest one ? 2. for cordinator meta data ERD changes/upgrade - if I set “druid.metadata.storage.connector.createTables=true” , on the current ERD 25.0.0 , and spin up new version of cordinator , it will resolve the ERD’s diff and apply them on db schema automaticly with no risk to the current data , or is it safer to rebuild a new erd for the upgraded version , have empty tables and only then migrate the data from “old” schema to new schema ?
    b
    • 2
    • 1
  • a

    apurav sharma

    06/25/2025, 11:42 PM
    @here i run into the following error while i run a sample task on the druid UI to send data to deepstorage (s3)
    Jun 25 15:44:35 127.0.0.1 java.lang.RuntimeException: com.amazonaws.services.s3.model.AmazonS3Exception: The authorization header is malformed; the region 'us-east-1' is wrong; expecting 'eu-central-1' (Service: Amazon S3; Status Code: 400; Error Code: AuthorizationHeaderMalformed;
    Imply was deployed on aws eks via helm chart. Does anyone have any clue? I’m not sure if there is a change required in the values.yaml to specify region? If yes than what exactly goes into the parameter ? Or is it related to IAM permission?
    b
    k
    • 3
    • 5
  • n

    Nir Bar On

    06/29/2025, 12:25 PM
    Hi question regarding compaction rest api payload - druid/coordinator/v1/config/compaction 1. filed maxRowPerSegment is on the root payload and also under partationSpec — what we should use , is both values need to be submitted on compaction task creation ? 2. filed maxTotalRow is on tunningConfig block and also on pertitionSpec block - what we should use ,is both values need to be submitted on compaction task creation ?
    a
    • 2
    • 1
  • n

    Neeraj Pmk

    07/02/2025, 12:18 PM
    Hi team, I am facing a GlueCatalog: ClassNotFoundException issue as mentioned here (https://github.com/apache/druid/issues/18015) when using Iceberg as the input source. From the comments, I see that the fix will be available in the next release. Can anyone please let me know when the next release(Druid 34.0.0) is planned?
    a
    • 2
    • 2
  • p

    PHP Dev

    07/03/2025, 8:29 AM
    Hi Team, I'm trying to enable indexing-log s3 type storage which should be available according to official documentation. But getting such error
    Copy code
    Unknown provider [s3] of Key[type=org.apache.druid.tasklogs.TaskLogs]
    s3-druid-extensions is loaded. What can be wrong?
    k
    • 2
    • 5
  • a

    Amperio Romano

    07/07/2025, 12:46 PM
    Hello, I have some Kafka ingestion data where in the supervisor I and create a column by concatenating 2 columns
    Copy code
    transformSpec: {
        transforms: [
            {
                type: 'expression',
                name: 'col1_and_col2_virtual',
                expression: "concat(col1, '-',  col2)"
            }
        ]
    },
    and then I create a HLL datasketch as a metric in the same kafka ingestion to have it pre-aggregated so that I can count the distinct instances of it quickly
    Copy code
    metricsSpec: [
        {
            name: 'col1_and_col2_hll',
            type: 'HLLSketchBuild',
            fieldName: 'col1_and_col2_virtual',
            lgK: 12,
            tgtHllType: 'HLL_4',
            round: true
        }
    ]
    col1_and_col2_virtual
    is not in the dimensions, so it is not stored, and everything looks good: it creates the
    col1_and_col2_hll
    correctly. Both
    col1
    and
    col2
    are always filled. The problem is when I try to calculate the number of distinct instances
    Copy code
    select  
        COUNT(*) as num_of_rows,
        APPROX_COUNT_DISTINCT_DS_HLL(col1_and_col2_hll) as hll_estimate
    from "my_datasource"
    hll_estimate
    is greater than the num_of_rows, which sounds really strange to me. I know that it is an estimation, but estimating it more than the total is surprising. Am I doing something wrong? Thanks.
  • m

    Milad

    07/07/2025, 5:10 PM
    Hello: I'm new to sketches and I am trying to compute quantiles from a KLL Doubles Sketch. I ingested some test data and I see
    COMPLEX<KllDoublesSketch>
    as the type on my sketch column. When I try to compute the median using
    APPROX_QUANTILE_DS
    I get this error using a sql query:
    Copy code
    Error: RUNTIME_FAILURE (OPERATOR)
    
    class org.apache.datasketches.kll.KllDirectDoublesSketch$KllDirectCompactDoublesSketch cannot be cast to class org.apache.datasketches.quantiles.DoublesSketch (org.apache.datasketches.kll.KllDirectDoublesSketch$KllDirectCompactDoublesSketch and org.apache.datasketches.quantiles.DoublesSketch are in unnamed module of loader 'app')
    
    java.lang.ClassCastException
    
    Host: localhost:8083
    I tried to run a query using the native query language and I get a slightly different error:
    Copy code
    Error: undefined
    
    Please make sure to load all the necessary extensions and jars with type 'kllDoublesSketchMerge' on 'druid/router' service. Could not resolve type id 'kllDoublesSketchMerge' as a subtype of `org.apache.druid.query.aggregation.AggregatorFactory` known type ids = [HLLSketch, HLLSketchBuild, HLLSketchMerge, KllDoublesSketch, KllDoublesSketchMerge, KllFloatsSketch, KllFloatsSketchMerge, arrayOfDoublesSketch, cardinality, count, doubleAny, doubleFirst, doubleLast, doubleMax, doubleMean, doubleMin, doubleSum, expression, filtered, floatAny, floatFirst, floatLast, floatMax, floatMin, floatSum, grouping, histogram, hyperUnique, javascript, longAny, longFirst, longLast, longMax, longMin, longSum, passthrough, quantilesDoublesSketch, quantilesDoublesSketchMerge, singleValue, sketchBuild, sketchMerge, stringAny, stringFirst, stringFirstFold, stringLast, stringLastFold, thetaSketch] (for POJO property 'aggregations') at [Source: (org.eclipse.jetty.server.HttpInputOverHTTP); line: 1, column: 141] (through reference chain: org.apache.druid.query.timeseries.TimeseriesQuery["aggregations"]->java.util.ArrayList[0])
    
    com.fasterxml.jackson.databind.exc.InvalidTypeIdException
    I noticed that the error mentions
    kllDoublesSketchMerge
    and the known IDs has a capital
    K
    at the front so maybe I did something wrong during ingestion? My ingestion command was:
    Copy code
    REPLACE INTO "sketch-kll-test"
    OVERWRITE ALL
    WITH "ext" AS (
      SELECT * FROM TABLE (
        EXTERN(
          '{"type":"local","files":["/sketch-test.csv"]}',
          '{"type":"csv","findColumnsFromHeader":true}'
        )
      ) EXTEND (
        "Customer ID" BIGINT,
        "Customer Name" VARCHAR,
        "Library ID" BIGINT,
        "Library Name" VARCHAR,
        "Sketch" VARCHAR
      )
    )
    SELECT
      TIME_PARSE('2025-06-25') AS "__time",
      "Customer ID",
      "Customer Name",
      "Library ID",
      "Library Name",
      DECODE_BASE64_COMPLEX('KllDoublesSketch', "Sketch") AS "Sketch"
    
    FROM "ext"
    PARTITIONED BY DAY
    I've included a screenshot showing all the plugins and I think I have those loaded correctly. Everything seems to work with the normal DoublesSketch but I can't get the KLL sketch to work. Thank You
    • 1
    • 1
  • p

    PHP Dev

    07/09/2025, 1:49 PM
    Hi team I'm trying to migrate from local to s3 storage. I changed storage type and copied segments to s3 manually. Than turned off local mounts and restarted cluster. After restart ingestion and querying works fine, but for example compaction task failed because of missing segments. As I understand my metadata storage contains wrong type and path for older segments. How can it be fixed?
    b
    • 2
    • 9
  • p

    Przemek

    07/09/2025, 4:17 PM
    Hello! I try to upgrade my Druid version from 30.0.1 to 33.0.0 but have issues with
    index_kafka
    tasks (cluster is deployed on Kuberenetes without
    druid-kubernetes-extensions
    and
    druid-kubernetes-overlord-extensions
    extensions). All kafka ingestion tasks ends with failures like
    Task [index_kafka_XYZ] failed to return start time, killing task
    . When I test my ingestion spec in data loader it is able to connect to Kafka topic without any problems. I checked logs and I found on coordinator warnings like
    Copy code
    2025-07-08T15:35:06,163 INFO [KafkaSupervisor-Golf_Lakehouse_Commentary_Feed_v2-Worker-0] org.apache.druid.rpc.ServiceClientImpl - Service [index_kafka_XYZ] request [GET <http://10.5.176.64:8101/druid/worker/v1/chat/index_kafka_XYZ/time/start>] encountered exception on attempt #1; retrying in 2,000 ms (org.jboss.netty.channel.ChannelException: Faulty channel in resource pool)
    ...
    up to 8 tries and then
    
    2025-07-08T15:36:10,172 WARN [KafkaSupervisor-Golf_Lakehouse_Commentary_Feed_v2] org.apache.druid.indexing.seekablestream.supervisor.SeekableStreamSupervisor - Task [index_kafka_XYZ] failed to return start time, killing task (org.apache.druid.rpc.RpcException: Service [index_kafka_XYZ] request [GET http://.../druid/worker/v1/chat/index_kafka_XYZ/time/start] encountered exception on attempt #9)
    I saw also some
    RejectedExecutionException
    not sure if correlated
    Copy code
    java.util.concurrent.RejectedExecutionException: Task java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask@71ade902[Not completed, task = java.util.concurrent.Executors$RunnableAdapter@4726a5f5[Wrapped task = CallbackListener{org.apache.druid.server.coordination.ChangeRequestHttpSyncer$1@38bb1199}]] rejected from java.util.concurrent.ScheduledThreadPoolExecutor@2e9e06a1[Terminated, pool size = 0, active threads = 0, queued tasks = 0, completed tasks = 109941]
    I found that even peons are not created when on all 3 of my middle_managers it shows that ~4 of 11 slots are used. When I redeployed coordinator it started processing 2 of my index_kafka tasks (and 2 peons were visible) but rest of them were still failing. When they finished no new peons were visible and no new tasks ended with success. What can be reason of that? I tried even with version 31.0.2 and result was the same. Did something changed with version 31+?
    • 1
    • 1
  • s

    Shanmugaraja

    07/10/2025, 6:21 AM
    Hello,I'm struggling with a Kafka ingestion spec in Apache Druid to flatten a nested JSON array. My Kafka messages have a structure like { "pda": { "fields": ["customer", "name", "age", "dob"], "Data": [ ["xyz1", "John", "27", "02/15/1992"], ["abc1", "William", "56", "05/02/1988"], ... ] } }.I want each row in pda.Data to be a separate row in datasource with columns company,customer, name, age and dob.I’m using Druid 31.Any advice on flattenSpec or transformSpec to unnest pda.Data correctly? Thanks!
  • j

    John Kowtko

    07/10/2025, 2:17 PM
    I haven’t tried this, but my thought it to ingest Data as an array field and then unnest it. I don’t think this can be done via a streaming supervisor, but you may be able to do it in one step using MSQ batch ingestion.
  • c

    Cristian Daniel Gelvis Bermudez

    07/10/2025, 6:31 PM
    Hello everyone, I'm trying to extract data from deep storage with a query to the /druid/v2/sql/statements/ endpoint. The task runs fine, but at the end, the following error occurs, preventing me from extracting the query response. { "error": "druidException", "errorCode": "notFound", "persona": "USER", "category": "NOT_FOUND", "errorMessage": "Query [query-9578562a-94f0-452d-998a-e66e0f7d0ff5] was not found. The query details are no longer present or might not be of the type [query_controller]. Verify that the id is correct.", "context": {} } Does anyone know why this happens?
    b
    • 2
    • 2
  • v

    Victoria

    07/11/2025, 3:24 AM
    Hey everyone. Maybe someone can help me figure out the numConnections value on the broker. According to the apache druid documentation this value across all brokers should be slightly lower than the http threads of all historicals and tasks and also slightly lower than httpThreads configured on the same broker. We are setting up a tiered cluster, for example, consider this topology: 1 master node, 1 query node, 3 hot data nodes (only historicals), 2 warm data nodes (only historicals) , 1 default data nodes (historical+middleManager). Now if we take that each data node is having 2 CPUs (just for the example) , the httpThreads would be 40 on each node (formula is Max(10, CPU*17/16 +2) +30 ). Should we then consider all the current nodes and httpThreads configured on historicals and middleManager to calculate the broker numConnections? If yes,.then the numConnections would be something like this : 40*3 node + 40*2 nodes + 40*2 (historical+middleManager) = 280 . Let's assume we will take 270 . Then the httpThreads for broker should be slightly higher than 270 => 280. With that many threads we would need 234 CPU! That's a huge machine to handle data nodes. I understand that something is off,.either docs or my understanding how those connections are fan out to data nodes. So this is where I'm struggling with defining correctly the broker machine.
  • s

    sandy k

    07/11/2025, 1:24 PM
    What causes segments to be published to metadata without in historicals? We have 4 segments which are is active is true but is available is false. How to recover without re-ingestion? What causes condition ? Segments are marked as used, present in metadata, in s3 deepstorage path exists with index.zip - segment id : FCT_REQ_2025-07-10T040000.000Z_2025-07-10T050000.000Z_2025-07-10T063511.787Z_91. Deepstorage i can see in S3 bucket -560001781 0/index.zip 509947770 1/index.zip 27781 10/0374285f-64c7-4e20-9ced-286e0117ac36/index.zip 27738 100/0061ad35-a3a3-46ce-9abc-3c46cbecbe46/index.zip 25359 101/f0a7c48a-5dd3-4571-acb1-863a47f8314e/index.zip. But this segment is not found any historical server using sys server.segments query. SegmentLoadingException: Failed to load segment in all locations. HttpLoadQueuePeon - Server request[LOAD] with cause [SegmentLoadingException: Exception loading segment]
    c
    • 2
    • 1