I've noticed if I delete segments from the UI it o...
# troubleshooting
p
I've noticed if I delete segments from the UI it only removes them from ZK but not deep storage. The next time I run an ingestion job for the table unrelated to the deleted segments it re-adds them to the table. Is this expected? Am I missing something?
k
I dont think the delete call, deletes it from the deep store. We delete it from deep store only retention manager kicks in (which is based on the retention set in table config)
p
It's expected that the ingestion job add the deleted segments back? The deleted segments even have different prefix than the one the job is running
k
The next time I run an ingestion job for the table unrelated to the deleted segments it re-adds them to the table. Is this expected? Am I missing something?
This is not expected. Can you show the segments list and ingestion job spec
p
Copy code
executionFrameworkSpec:
  name: 'standalone'
  segmentGenerationJobRunnerClassName: 'org.apache.pinot.plugin.ingestion.batch.standalone.SegmentGenerationJobRunner'
  segmentUriPushJobRunnerClassName: 'org.apache.pinot.plugin.ingestion.batch.standalone.SegmentUriPushJobRunner'
jobType: SegmentCreationAndUriPush
inputDirURI: '<s3://pinot-io/dotnet>'
includeFileNamePattern: 'glob:**/*.parquet'
outputDirURI: '<s3://pinot-io/segments>'
segmentCreationJobParallelism: 4
overwriteOutput: true
pinotFSSpecs:
  - scheme: s3
    className: org.apache.pinot.plugin.filesystem.S3PinotFS
    configs:
      region: 'us-east-1'
      endpoint: '<http://pinot-minio:9090>'
      accessKey: 'pinot'
      secretKey: 'pinot!!!'
recordReaderSpec:
  dataFormat: 'parquet'
  className: 'org.apache.pinot.plugin.inputformat.parquet.ParquetRecordReader'
tableSpec:
  tableName: 'mm'
  schemaURI: '<http://pinot-controller:9000/tables/mm/schema>'
  tableConfigURI: '<http://pinot-controller:9000/tables/mm>'
pinotClusterSpecs:
  - controllerURI: '<http://pinot-controller:9000>'
pushJobSpec:
  pushAttempts: 2
  pushRetryIntervalMillis: 1000
segmentNameGeneratorSpec:
  type: normalizedDate
  configs:
    segment.name.prefix: 'mm_batch_test'
Copy code
{
  "id": "mm_OFFLINE",
  "simpleFields": {
    "BATCH_MESSAGE_MODE": "false",
    "IDEAL_STATE_MODE": "CUSTOMIZED",
    "INSTANCE_GROUP_TAG": "mm_OFFLINE",
    "MAX_PARTITIONS_PER_INSTANCE": "1",
    "NUM_PARTITIONS": "3",
    "REBALANCE_MODE": "CUSTOMIZED",
    "REPLICAS": "1",
    "STATE_MODEL_DEF_REF": "SegmentOnlineOfflineStateModel",
    "STATE_MODEL_FACTORY_NAME": "DEFAULT"
  },
  "mapFields": {
    "mm_batch_test_2020-11-19_2020-11-19_0": {
      "Server_172.20.0.6_8098": "ONLINE"
    },
    "mm_batch_test_2020-11-19_2020-11-19_1": {
      "Server_172.20.0.6_8098": "ONLINE"
    },
    "mm_batch_test_2020-11-19_2020-11-19_2": {
      "Server_172.20.0.6_8098": "ONLINE"
    }
  },
  "listFields": {}
}
after running the job:
Copy code
{
  "id": "mm_OFFLINE",
  "simpleFields": {
    "BATCH_MESSAGE_MODE": "false",
    "IDEAL_STATE_MODE": "CUSTOMIZED",
    "INSTANCE_GROUP_TAG": "mm_OFFLINE",
    "MAX_PARTITIONS_PER_INSTANCE": "1",
    "NUM_PARTITIONS": "7",
    "REBALANCE_MODE": "CUSTOMIZED",
    "REPLICAS": "1",
    "STATE_MODEL_DEF_REF": "SegmentOnlineOfflineStateModel",
    "STATE_MODEL_FACTORY_NAME": "DEFAULT"
  },
  "mapFields": {
    "mm_batch1_test_2020-11-19_2020-11-19_0": {
      "Server_172.20.0.6_8098": "ONLINE"
    },
    "mm_batch1_test_2020-11-19_2020-11-19_1": {
      "Server_172.20.0.6_8098": "ONLINE"
    },
    "mm_batch2_test_2020-11-19_2020-11-19_0": {
      "Server_172.20.0.6_8098": "ONLINE"
    },
    "mm_batch2_test_2020-11-19_2020-11-19_1": {
      "Server_172.20.0.6_8098": "ONLINE"
    },
    "mm_batch_test_2020-11-19_2020-11-19_0": {
      "Server_172.20.0.6_8098": "ONLINE"
    },
    "mm_batch_test_2020-11-19_2020-11-19_1": {
      "Server_172.20.0.6_8098": "ONLINE"
    },
    "mm_batch_test_2020-11-19_2020-11-19_2": {
      "Server_172.20.0.6_8098": "ONLINE"
    }
  },
  "listFields": {}
}
picked up the old deleted segments from other batch jobs
I was expecting it to just replace the existing mm_batch_test segments
k
May be because of old segment is still there in the output for of the ingestion job?
p
yes I see that
what's the purpose of the ingestion job output and leaving the output files there?
k
I don’t see any reason
We should delete it... also in spark mode the task directory gets deleted automatically after task is run.. that’s probably why we don’t delete it explicitly
Mind filing an issue?
p
Sure
thank you
s
The delete API should delete it right away, not just when the retention manager kicks in. It will move it to the deleted folder inside your deep store (I forget whether this is based on config), where will reside for some number of days, and then be removed. It is a bug if the segments still show up on a new table. It is possible if the new table is added within a very short time of deletion, because it takes a few seconds for the segments to be deleted, and then for helix externalview to stabilize. So, we always advise creating tables with a different name
p
I was deleting individual segments of the table and never saw them removed from deep storage, are the servers responsible for task for deleting?
s
Nope, they should get deleted when you delete the segments. Like I said, they are moved into a folder called
Deleted_Segments/tableName