This message was deleted.
# troubleshooting
s
This message was deleted.
s
Could you also share the overlord log from around that time? Segment allocation is done by the overlord. So that might tell us something. Something I noticed is that the timestamp it is loading is from 20 days ago... that might or might not be relevant. The task failures in the middle manager logs (code 137) is out of memory.
a
i've fixed few things like segmentG, and resource because of which i was sseing 137 error MM. now most of them are in running state but can see error for what task error status
Copy code
{
  "id": "index_kafka_eber_gateways_sensors_data_5a0ebe22f44a3f5_cepalohc",
  "groupId": "index_kafka_eber_gateways_sensors_data",
  "type": "index_kafka",
  "createdTime": "2023-06-28T06:53:01.032Z",
  "queueInsertionTime": "1970-01-01T00:00:00.000Z",
  "statusCode": "FAILED",
  "status": "FAILED",
  "runnerStatusCode": "WAITING",
  "duration": -1,
  "location": {
    "host": "10.101.60.160",
    "port": 8100,
    "tlsPort": -1
  },
  "dataSource": "eber_gateways_sensors_data",
  "errorMsg": "The worker that this task was assigned disappeared and did not report cleanup within timeout[PT15M]...."
}
and i can see error on my coordinator node as we
Copy code
2023-06-28T04:05:45,454 ERROR [qtp1286172885-118] org.apache.druid.indexing.common.actions.SegmentAllocateAction - Could not allocate pending segment for rowInterval[2023-06-27T11:15:20.599Z/2023-06-27T11:15:20.600Z], segmentInterval[2023-06-26T00:00:00.000Z/2023-07-03T00:00:00.000Z].
2023-06-28T04:48:58,796 ERROR [Coordinator-Exec--0] org.apache.druid.server.coordinator.rules.LoadRule - Tier[_default_tier] has no servers! Check your cluster configuration!: {class=org.apache.druid.server.coordinator.rules.LoadRule}
a
a
its for the sam e
"skipOffsetFromLatest": "PT1H",
this is what i have changed
@Amatya Avadhanula
a
2023-06-28T044858,796 ERROR [Coordinator-Exec--0] org.apache.druid.server.coordinator.rules.LoadRule - Tier[_default_tier] has no servers! Check your cluster configuration!: {class=org.apache.druid.server.coordinator.rules.LoadRule}
There is a historical now so I don't think this should present anymore. Could you confirm if the historical recently restarted?
"skipOffsetFromLatest": "PT1H",
This is not the segment granularity
a
i've made few changes in resources that why it got restarted 25 h back
so i recently did the compaction where i put the above mention "skipOffsetFromLatest": "PT1H",
before it was for a week ive changed it to hour and below is the supervisor conf and there we have defined the granularity as MONTH are you sure this is because of the granularity as exact same thing is running in my different environment and it working fine that what possibly making the issue here ?
Copy code
{
  "type": "kafka",
  "spec": {
    "dataSchema": {
      "dataSource": "eber_vehicle_components_status",
      "timestampSpec": {
        "column": "timestamp",
        "format": "millis",
        "missingValue": null
      },
      "dimensionsSpec": {
        "dimensions": [
          {
            "type": "string",
            "name": "gateway_id",
            "multiValueHandling": "SORTED_ARRAY",
            "createBitmapIndex": true
          },
          {
            "type": "string",
            "name": "value",
            "multiValueHandling": "SORTED_ARRAY",
            "createBitmapIndex": true
          },
          {
            "type": "float",
            "name": "id",
            "multiValueHandling": "SORTED_ARRAY",
            "createBitmapIndex": false
          }
        ],
        "dimensionExclusions": [
          "__time",
          "timestamp"
        ],
        "includeAllDimensions": false
      },
      "metricsSpec": [],
      "granularitySpec": {
        "type": "uniform",
        "segmentGranularity": "MONTH",
        "queryGranularity": {
          "type": "none"
        },
        "rollup": true,
        "intervals": []
      },
      "transformSpec": {
        "filter": null,
        "transforms": []
      }
    },
    "ioConfig": {
      "topic": "eber.vehicle.components.status.qc",
      "inputFormat": {
        "type": "json",
        "flattenSpec": {
          "useFieldDiscovery": true,
          "fields": []
        },
        "keepNullColumns": true,
        "assumeNewlineDelimited": false,
        "useJsonNodeReader": false
      },
      "replicas": 1,
      "taskCount": 1,
      "taskDuration": "PT86400S",
      "consumerProperties": {
        "bootstrap.servers": "qc-kafka-kafka-bootstrap.kafka.svc.cluster.local:9092,",
        "security.protocol": "SASL_PLAINTEXT",
        "sasl.mechanism": "SCRAM-SHA-512",
        "sasl.jaas.config": "org.apache.kafka.common.security.scram.ScramLoginModule required username='admin-etm-qc' password='xxx';",
        "auto.offset.reset": "earliest"
      },
      "autoScalerConfig": null,
      "pollTimeout": 100,
      "startDelay": "PT5S",
      "period": "PT30S",
      "useEarliestOffset": true,
      "completionTimeout": "PT1800S",
      "lateMessageRejectionPeriod": null,
      "earlyMessageRejectionPeriod": null,
      "lateMessageRejectionStartDateTime": null,
      "configOverrides": null,
      "idleConfig": null,
      "stream": "eber.vehicle.components.status.qc",
      "useEarliestSequenceNumber": true,
      "type": "kafka"
    },
    "tuningConfig": {
      "type": "kafka",
      "appendableIndexSpec": {
        "type": "onheap",
        "preserveExistingMetrics": false
      },
      "maxRowsInMemory": 1000000,
      "maxBytesInMemory": 0,
      "skipBytesInMemoryOverheadCheck": false,
      "maxRowsPerSegment": 5000000,
      "maxTotalRows": null,
      "intermediatePersistPeriod": "PT10M",
      "maxPendingPersists": 0,
      "indexSpec": {
        "bitmap": {
          "type": "roaring",
          "compressRunOnSerialization": true
        },
        "dimensionCompression": "lz4",
        "stringDictionaryEncoding": {
          "type": "utf8"
        },
        "metricCompression": "lz4",
        "longEncoding": "longs"
      },
      "indexSpecForIntermediatePersists": {
        "bitmap": {
          "type": "roaring",
          "compressRunOnSerialization": true
        },
        "dimensionCompression": "lz4",
        "stringDictionaryEncoding": {
          "type": "utf8"
        },
        "metricCompression": "lz4",
        "longEncoding": "longs"
      },
      "reportParseExceptions": false,
      "handoffConditionTimeout": 0,
      "resetOffsetAutomatically": false,
      "segmentWriteOutMediumFactory": null,
      "workerThreads": null,
      "chatThreads": null,
      "chatRetries": 8,
      "httpTimeout": "PT10S",
      "shutdownTimeout": "PT80S",
      "offsetFetchPeriod": "PT30S",
      "intermediateHandoffPeriod": "P2147483647D",
      "logParseExceptions": false,
      "maxParseExceptions": 2147483647,
      "maxSavedParseExceptions": 0,
      "skipSequenceNumberAvailabilityCheck": false,
      "repartitionTransitionDuration": "PT120S"
    }
  },
  "context": null
}
a
Could you share the compaction spec as well?
a
sure
Copy code
{
  "dataSource": "eber_vehicle_components_status",
  "taskPriority": 80,
  "inputSegmentSizeBytes": 100000000000000,
  "maxRowsPerSegment": null,
  "skipOffsetFromLatest": "PT1H",
  "tuningConfig": {
    "maxRowsInMemory": null,
    "appendableIndexSpec": null,
    "maxBytesInMemory": null,
    "maxTotalRows": null,
    "splitHintSpec": null,
    "partitionsSpec": {
      "type": "dynamic",
      "maxRowsPerSegment": 5000000,
      "maxTotalRows": null
    },
    "indexSpec": null,
    "indexSpecForIntermediatePersists": null,
    "maxPendingPersists": null,
    "pushTimeout": null,
    "segmentWriteOutMediumFactory": null,
    "maxNumConcurrentSubTasks": null,
    "maxRetry": null,
    "taskStatusCheckPeriodMs": null,
    "chatHandlerTimeout": null,
    "chatHandlerNumRetries": null,
    "maxNumSegmentsToMerge": null,
    "totalNumMergeTasks": null,
    "maxColumnsToMerge": null,
    "type": "index_parallel",
    "forceGuaranteedRollup": false
  },
  "granularitySpec": null,
  "dimensionsSpec": null,
  "metricsSpec": null,
  "transformSpec": null,
  "ioConfig": null,
  "taskContext": null
}
a
Strange, my suspicion was that you were using a granularity of WEEK in the compaction spec. Sorry to repeat, but could you please confirm from the segments tab if there are segments with week granularity that already exist for 2023-06-26/2023-07-03?
a
@Amatya Avadhanula can you tell me how can do so! below is the ss for segments which might help you for your question
a
The first few (4) rows have a weekly granularity (refer to the start and end columns of segments). These belong to the datasource eber_v...
You could stop ingestion for this datasource, reindex the data for the problematic intervals to monthly granularity and then resume ingestion
a
any suggestive ways to do so ?
and do think this is the only issue or it could be something else to ?
a
and do think this is the only issue or it could be something else to ?
Do you observe any other issues with ingestion for datasources having similar volume + task count? If not, I think this could be the only issue
a
no i couldn’t
a
1. Suspend the supervisor for the datasource. (Three dots next to the magnifying glass icon > suspend) 2. Submit a reindexing task using the native batch ingestion wizard for the interval 2023-06-01/2023-08-01 for the datasource (eber_v...) and follow the wizard. Just change the segment granularity from week to month 3. Resume the supervisor for the datasource
a
let me try right now
@Amatya Avadhanula am i at the right place ?
cant see those dates that you suggested !
a
The supervisor cannot be used to reindex
Load data > Start a new batch spec > Reindex from druid
a
i have to add month here. ?
after putting dates i did next next
a
No, this is query granularity. Please don't modify it
a
ok i didn’t 😅is it the right place ?
a
Yes!
a
what should be the Partitioning type & Time intervals
as it doest allow me to move forward
a
Partitioning type -> dynamic (for now). intervals can be left blank
a
Max rows per segment Max total rows can i use the older one or we should use a new one
a
defaults should be fine
a
okay
this should be default as well ? im sorry i to trouble you just want to be confident about it
a
Yes, the defaults would work since there isn't a lot of data yet
a
i think think this is the last one this should be default to
a
Yes
a
ive submitted it and its back to the same page with the same specs and still on suspend have we missed any thing ?
a
No, please check the task's status
After the task succeeds, you can resume the supervisor
a
its failed
a
Error?
a
in middlemanager logs
Copy code
2023-06-28T18:01:36,648 ERROR [forking-task-runner-13] org.apache.druid.indexing.overlord.ForkingTaskRunner - Process exited with code[137] for task: index_parallel_eber_vehicle_components_status_beemdiaf_2023-06-28T18:01:17.205Z
but there are only 3 pods running and ive given maximumreplicas to 5 it doest seems to be an issue with resources
Copy code
{
  "id": "index_parallel_eber_vehicle_components_status_beemdiaf_2023-06-28T18:01:17.205Z",
  "groupId": "index_parallel_eber_vehicle_components_status_beemdiaf_2023-06-28T18:01:17.205Z",
  "type": "index_parallel",
  "createdTime": "2023-06-28T18:01:17.206Z",
  "queueInsertionTime": "1970-01-01T00:00:00.000Z",
  "statusCode": "FAILED",
  "status": "FAILED",
  "runnerStatusCode": "WAITING",
  "duration": 19295,
  "location": {
    "host": "10.101.42.175",
    "port": 8101,
    "tlsPort": -1
  },
  "dataSource": "eber_vehicle_components_status",
  "errorMsg": "Task execution process exited unsuccessfully with code[137]. See middleManager logs for more details..."
}
a
Code 137 is generally due to memory limits
a
let me check
its really less
just increased the number of resources to verify that this is not an issue !!
do we have to rerun this or it will run by it self. ?
it came on running and succeded i resume it after running for few min it got unhealthy again
Copy code
{
  "id": "index_kafka_eber_vehicle_components_status_7825e7f874a89fb_eddnapjj",
  "groupId": "index_kafka_eber_vehicle_components_status",
  "type": "index_kafka",
  "createdTime": "2023-06-28T18:35:37.814Z",
  "queueInsertionTime": "1970-01-01T00:00:00.000Z",
  "statusCode": "FAILED",
  "status": "FAILED",
  "runnerStatusCode": "WAITING",
  "duration": 12809,
  "location": {
    "host": "10.101.34.114",
    "port": 8100,
    "tlsPort": -1
  },
  "dataSource": "eber_vehicle_components_status",
  "errorMsg": "org.apache.druid.java.util.common.ISE: Could not allocate segment for row with timestamp[2023-06-28T..."