This message was deleted Apache Druid #troubleshooting

Join Slack

This message was deleted.

# troubleshooting

Slackbot

06/27/2023, 8:03 AM

This message was deleted.

Sergio Ferragut

06/27/2023, 5:52 PM

Could you also share the overlord log from around that time? Segment allocation is done by the overlord. So that might tell us something. Something I noticed is that the timestamp it is loading is from 20 days ago... that might or might not be relevant. The task failures in the middle manager logs (code 137) is out of memory.

Anant Sharma

06/28/2023, 7:25 AM

i've fixed few things like segmentG, and resource because of which i was sseing 137 error MM. now most of them are in running state but can see error for what task error status

Copy code

{
  "id": "index_kafka_eber_gateways_sensors_data_5a0ebe22f44a3f5_cepalohc",
  "groupId": "index_kafka_eber_gateways_sensors_data",
  "type": "index_kafka",
  "createdTime": "2023-06-28T06:53:01.032Z",
  "queueInsertionTime": "1970-01-01T00:00:00.000Z",
  "statusCode": "FAILED",
  "status": "FAILED",
  "runnerStatusCode": "WAITING",
  "duration": -1,
  "location": {
    "host": "10.101.60.160",
    "port": 8100,
    "tlsPort": -1
  },
  "dataSource": "eber_gateways_sensors_data",
  "errorMsg": "The worker that this task was assigned disappeared and did not report cleanup within timeout[PT15M]...."
}

and i can see error on my coordinator node as we

Copy code

2023-06-28T04:05:45,454 ERROR [qtp1286172885-118] org.apache.druid.indexing.common.actions.SegmentAllocateAction - Could not allocate pending segment for rowInterval[2023-06-27T11:15:20.599Z/2023-06-27T11:15:20.600Z], segmentInterval[2023-06-26T00:00:00.000Z/2023-07-03T00:00:00.000Z].
2023-06-28T04:48:58,796 ERROR [Coordinator-Exec--0] org.apache.druid.server.coordinator.rules.LoadRule - Tier[_default_tier] has no servers! Check your cluster configuration!: {class=org.apache.druid.server.coordinator.rules.LoadRule}

Amatya Avadhanula

06/28/2023, 7:32 AM

https://apachedruidworkspace.slack.com/archives/C0303FDCZEZ/p1687937064821759?thread_ts=1687788310.124809&cid=C0303FDCZEZ - Is this thread on #C0303FDCZEZ for a different issue?

Anant Sharma

06/28/2023, 7:44 AM

its for the sam e

Anant Sharma

06/28/2023, 7:45 AM

"skipOffsetFromLatest": "PT1H",

Anant Sharma

06/28/2023, 7:46 AM

this is what i have changed

Anant Sharma

06/28/2023, 8:17 AM

@Amatya Avadhanula

Amatya Avadhanula

06/28/2023, 9:25 AM

2023-06-28T044858,796 ERROR [Coordinator-Exec--0] org.apache.druid.server.coordinator.rules.LoadRule - Tier[_default_tier] has no servers! Check your cluster configuration!: {class=org.apache.druid.server.coordinator.rules.LoadRule}

There is a historical now so I don't think this should present anymore. Could you confirm if the historical recently restarted?

Amatya Avadhanula

06/28/2023, 9:26 AM

"skipOffsetFromLatest": "PT1H",

This is not the segment granularity

Amatya Avadhanula

06/28/2023, 9:29 AM

Please refer to: https://druid.apache.org/docs/latest/ingestion/ingestion-spec.html#granularityspec

Anant Sharma

06/28/2023, 11:28 AM

i've made few changes in resources that why it got restarted 25 h back

so i recently did the compaction where i put the above mention "skipOffsetFromLatest": "PT1H",

before it was for a week ive changed it to hour and below is the supervisor conf and there we have defined the granularity as MONTH are you sure this is because of the granularity as exact same thing is running in my different environment and it working fine that what possibly making the issue here ?

Copy code

{
  "type": "kafka",
  "spec": {
    "dataSchema": {
      "dataSource": "eber_vehicle_components_status",
      "timestampSpec": {
        "column": "timestamp",
        "format": "millis",
        "missingValue": null
      },
      "dimensionsSpec": {
        "dimensions": [
          {
            "type": "string",
            "name": "gateway_id",
            "multiValueHandling": "SORTED_ARRAY",
            "createBitmapIndex": true
          },
          {
            "type": "string",
            "name": "value",
            "multiValueHandling": "SORTED_ARRAY",
            "createBitmapIndex": true
          },
          {
            "type": "float",
            "name": "id",
            "multiValueHandling": "SORTED_ARRAY",
            "createBitmapIndex": false
          }
        ],
        "dimensionExclusions": [
          "__time",
          "timestamp"
        ],
        "includeAllDimensions": false
      },
      "metricsSpec": [],
      "granularitySpec": {
        "type": "uniform",
        "segmentGranularity": "MONTH",
        "queryGranularity": {
          "type": "none"
        },
        "rollup": true,
        "intervals": []
      },
      "transformSpec": {
        "filter": null,
        "transforms": []
      }
    },
    "ioConfig": {
      "topic": "eber.vehicle.components.status.qc",
      "inputFormat": {
        "type": "json",
        "flattenSpec": {
          "useFieldDiscovery": true,
          "fields": []
        },
        "keepNullColumns": true,
        "assumeNewlineDelimited": false,
        "useJsonNodeReader": false
      },
      "replicas": 1,
      "taskCount": 1,
      "taskDuration": "PT86400S",
      "consumerProperties": {
        "bootstrap.servers": "qc-kafka-kafka-bootstrap.kafka.svc.cluster.local:9092,",
        "security.protocol": "SASL_PLAINTEXT",
        "sasl.mechanism": "SCRAM-SHA-512",
        "sasl.jaas.config": "org.apache.kafka.common.security.scram.ScramLoginModule required username='admin-etm-qc' password='xxx';",
        "auto.offset.reset": "earliest"
      },
      "autoScalerConfig": null,
      "pollTimeout": 100,
      "startDelay": "PT5S",
      "period": "PT30S",
      "useEarliestOffset": true,
      "completionTimeout": "PT1800S",
      "lateMessageRejectionPeriod": null,
      "earlyMessageRejectionPeriod": null,
      "lateMessageRejectionStartDateTime": null,
      "configOverrides": null,
      "idleConfig": null,
      "stream": "eber.vehicle.components.status.qc",
      "useEarliestSequenceNumber": true,
      "type": "kafka"
    },
    "tuningConfig": {
      "type": "kafka",
      "appendableIndexSpec": {
        "type": "onheap",
        "preserveExistingMetrics": false
      },
      "maxRowsInMemory": 1000000,
      "maxBytesInMemory": 0,
      "skipBytesInMemoryOverheadCheck": false,
      "maxRowsPerSegment": 5000000,
      "maxTotalRows": null,
      "intermediatePersistPeriod": "PT10M",
      "maxPendingPersists": 0,
      "indexSpec": {
        "bitmap": {
          "type": "roaring",
          "compressRunOnSerialization": true
        },
        "dimensionCompression": "lz4",
        "stringDictionaryEncoding": {
          "type": "utf8"
        },
        "metricCompression": "lz4",
        "longEncoding": "longs"
      },
      "indexSpecForIntermediatePersists": {
        "bitmap": {
          "type": "roaring",
          "compressRunOnSerialization": true
        },
        "dimensionCompression": "lz4",
        "stringDictionaryEncoding": {
          "type": "utf8"
        },
        "metricCompression": "lz4",
        "longEncoding": "longs"
      },
      "reportParseExceptions": false,
      "handoffConditionTimeout": 0,
      "resetOffsetAutomatically": false,
      "segmentWriteOutMediumFactory": null,
      "workerThreads": null,
      "chatThreads": null,
      "chatRetries": 8,
      "httpTimeout": "PT10S",
      "shutdownTimeout": "PT80S",
      "offsetFetchPeriod": "PT30S",
      "intermediateHandoffPeriod": "P2147483647D",
      "logParseExceptions": false,
      "maxParseExceptions": 2147483647,
      "maxSavedParseExceptions": 0,
      "skipSequenceNumberAvailabilityCheck": false,
      "repartitionTransitionDuration": "PT120S"
    }
  },
  "context": null
}

Amatya Avadhanula

06/28/2023, 11:33 AM

Could you share the compaction spec as well?

Anant Sharma

06/28/2023, 11:35 AM

sure

Anant Sharma

06/28/2023, 11:35 AM

Copy code

{
  "dataSource": "eber_vehicle_components_status",
  "taskPriority": 80,
  "inputSegmentSizeBytes": 100000000000000,
  "maxRowsPerSegment": null,
  "skipOffsetFromLatest": "PT1H",
  "tuningConfig": {
    "maxRowsInMemory": null,
    "appendableIndexSpec": null,
    "maxBytesInMemory": null,
    "maxTotalRows": null,
    "splitHintSpec": null,
    "partitionsSpec": {
      "type": "dynamic",
      "maxRowsPerSegment": 5000000,
      "maxTotalRows": null
    },
    "indexSpec": null,
    "indexSpecForIntermediatePersists": null,
    "maxPendingPersists": null,
    "pushTimeout": null,
    "segmentWriteOutMediumFactory": null,
    "maxNumConcurrentSubTasks": null,
    "maxRetry": null,
    "taskStatusCheckPeriodMs": null,
    "chatHandlerTimeout": null,
    "chatHandlerNumRetries": null,
    "maxNumSegmentsToMerge": null,
    "totalNumMergeTasks": null,
    "maxColumnsToMerge": null,
    "type": "index_parallel",
    "forceGuaranteedRollup": false
  },
  "granularitySpec": null,
  "dimensionsSpec": null,
  "metricsSpec": null,
  "transformSpec": null,
  "ioConfig": null,
  "taskContext": null
}

Amatya Avadhanula

06/28/2023, 11:41 AM

Strange, my suspicion was that you were using a granularity of WEEK in the compaction spec. Sorry to repeat, but could you please confirm from the segments tab if there are segments with week granularity that already exist for 2023-06-26/2023-07-03?

Anant Sharma

06/28/2023, 2:31 PM

@Amatya Avadhanula can you tell me how can do so! below is the ss for segments which might help you for your question

Amatya Avadhanula

06/28/2023, 5:12 PM

The first few (4) rows have a weekly granularity (refer to the start and end columns of segments). These belong to the datasource eber_v...

Amatya Avadhanula

06/28/2023, 5:13 PM

You could stop ingestion for this datasource, reindex the data for the problematic intervals to monthly granularity and then resume ingestion

Anant Sharma

06/28/2023, 5:19 PM

any suggestive ways to do so ?

Anant Sharma

06/28/2023, 5:21 PM

and do think this is the only issue or it could be something else to ?

Amatya Avadhanula

06/28/2023, 5:23 PM

and do think this is the only issue or it could be something else to ?

Do you observe any other issues with ingestion for datasources having similar volume + task count? If not, I think this could be the only issue

Anant Sharma

06/28/2023, 5:31 PM

no i couldn’t

Amatya Avadhanula

06/28/2023, 5:32 PM

1. Suspend the supervisor for the datasource. (Three dots next to the magnifying glass icon > suspend) 2. Submit a reindexing task using the native batch ingestion wizard for the interval 2023-06-01/2023-08-01 for the datasource (eber_v...) and follow the wizard. Just change the segment granularity from week to month 3. Resume the supervisor for the datasource

Anant Sharma

06/28/2023, 5:33 PM

let me try right now

Anant Sharma

06/28/2023, 5:39 PM

@Amatya Avadhanula am i at the right place ?

Anant Sharma

06/28/2023, 5:40 PM

cant see those dates that you suggested !

Amatya Avadhanula

06/28/2023, 5:41 PM

The supervisor cannot be used to reindex

Amatya Avadhanula

06/28/2023, 5:42 PM

Load data > Start a new batch spec > Reindex from druid

Anant Sharma

06/28/2023, 5:47 PM

i have to add month here. ?

Anant Sharma

06/28/2023, 5:48 PM

after putting dates i did next next

Amatya Avadhanula

06/28/2023, 5:48 PM

No, this is query granularity. Please don't modify it

Anant Sharma

06/28/2023, 5:50 PM

ok i didn’t 😅is it the right place ?

Amatya Avadhanula

06/28/2023, 5:50 PM

Yes!

Anant Sharma

06/28/2023, 5:51 PM

what should be the Partitioning type & Time intervals

Anant Sharma

06/28/2023, 5:51 PM

as it doest allow me to move forward

Amatya Avadhanula

06/28/2023, 5:51 PM

Partitioning type -> dynamic (for now). intervals can be left blank

Anant Sharma

06/28/2023, 5:54 PM

Max rows per segment Max total rows can i use the older one or we should use a new one

Amatya Avadhanula

06/28/2023, 5:54 PM

defaults should be fine

Anant Sharma

06/28/2023, 5:54 PM

okay

Anant Sharma

06/28/2023, 5:57 PM

this should be default as well ? im sorry i to trouble you just want to be confident about it

Amatya Avadhanula

06/28/2023, 5:57 PM

Yes, the defaults would work since there isn't a lot of data yet

Anant Sharma

06/28/2023, 5:59 PM

i think think this is the last one this should be default to

Amatya Avadhanula

06/28/2023, 5:59 PM

Yes

Anant Sharma

06/28/2023, 6:03 PM

ive submitted it and its back to the same page with the same specs and still on suspend have we missed any thing ?

Amatya Avadhanula

06/28/2023, 6:04 PM

No, please check the task's status

Amatya Avadhanula

06/28/2023, 6:05 PM

After the task succeeds, you can resume the supervisor

Anant Sharma

06/28/2023, 6:06 PM

its failed

Amatya Avadhanula

06/28/2023, 6:07 PM

Error?

Anant Sharma

06/28/2023, 6:09 PM

in middlemanager logs

Copy code

2023-06-28T18:01:36,648 ERROR [forking-task-runner-13] org.apache.druid.indexing.overlord.ForkingTaskRunner - Process exited with code[137] for task: index_parallel_eber_vehicle_components_status_beemdiaf_2023-06-28T18:01:17.205Z

but there are only 3 pods running and ive given maximumreplicas to 5 it doest seems to be an issue with resources

Anant Sharma

06/28/2023, 6:10 PM

Copy code

{
  "id": "index_parallel_eber_vehicle_components_status_beemdiaf_2023-06-28T18:01:17.205Z",
  "groupId": "index_parallel_eber_vehicle_components_status_beemdiaf_2023-06-28T18:01:17.205Z",
  "type": "index_parallel",
  "createdTime": "2023-06-28T18:01:17.206Z",
  "queueInsertionTime": "1970-01-01T00:00:00.000Z",
  "statusCode": "FAILED",
  "status": "FAILED",
  "runnerStatusCode": "WAITING",
  "duration": 19295,
  "location": {
    "host": "10.101.42.175",
    "port": 8101,
    "tlsPort": -1
  },
  "dataSource": "eber_vehicle_components_status",
  "errorMsg": "Task execution process exited unsuccessfully with code[137]. See middleManager logs for more details..."
}

Amatya Avadhanula

06/28/2023, 6:10 PM

Code 137 is generally due to memory limits

Anant Sharma

06/28/2023, 6:16 PM

let me check

Anant Sharma

06/28/2023, 6:19 PM

its really less

Anant Sharma

06/28/2023, 6:26 PM

just increased the number of resources to verify that this is not an issue !!

Anant Sharma

06/28/2023, 6:27 PM

do we have to rerun this or it will run by it self. ?

Anant Sharma

06/28/2023, 6:38 PM

it came on running and succeded i resume it after running for few min it got unhealthy again

Copy code

{
  "id": "index_kafka_eber_vehicle_components_status_7825e7f874a89fb_eddnapjj",
  "groupId": "index_kafka_eber_vehicle_components_status",
  "type": "index_kafka",
  "createdTime": "2023-06-28T18:35:37.814Z",
  "queueInsertionTime": "1970-01-01T00:00:00.000Z",
  "statusCode": "FAILED",
  "status": "FAILED",
  "runnerStatusCode": "WAITING",
  "duration": 12809,
  "location": {
    "host": "10.101.34.114",
    "port": 8100,
    "tlsPort": -1
  },
  "dataSource": "eber_vehicle_components_status",
  "errorMsg": "org.apache.druid.java.util.common.ISE: Could not allocate segment for row with timestamp[2023-06-28T..."

Open in Slack

Previous Next