Is it possible to manually add segments to a table...
# general
b
Is it possible to manually add segments to a table that have already been uploaded to the deep store without uploading each segment to the controller? Basically I want to send just the metadata to the controller to add the segments to the table
b
thanks!
s
@Neha Pawar on a related note, let's say in a disaster recovery situation that we lost an upsert enabled realtime table. What backup strategy (if any) can we employ so that we could restore that table?
n
upsert table’s segments will follow same deep store persistence as any regular table. when you start uploading segments to your new table, the upsert metadata will be rebuilt
@Kartik Khare any thing to watch out for when doing upload to a realtime table with upsert config, in such a DR scenario?
k
No, it works like any other table. Only thing is it might take a bit more time since it has to build uperset metadata for each segment before loading next.
n
Would one have to make sure to reupload the segments in same sequence number order I assume? Or does the metadata manager take care of it? (Say I uploaded table_1_19_ before table_1_18_)
k
Metadata manager will take care of out of order. Only the record with highest comparison column value will remain in the end.
The intermediate state can be inconsistent though for out of order so you will have to wait till all segments get reloaded. This is what we do for partial insert as well inside the code.
a
@Neha Pawar So I'm trying to run SegmentMetadataPush, but it appears that the code is expecting the segment files to have a .tar.gz extension while our files have no extension at all. Maybe I'm missing a parameter in the job config? Here's the command I'm running:
Copy code
pinot-admin.sh LaunchDataIngestionJob -jobSpecFile /tmp/segment_metadata_push_config.yaml
Here's the output:
Copy code
SegmentGenerationJobSpec:
!!org.apache.pinot.spi.ingestion.batch.spec.SegmentGenerationJobSpec
authToken: null
cleanUpOutputDir: false
excludeFileNamePattern: null
executionFrameworkSpec: {extraConfigs: null, name: standalone, segmentGenerationJobRunnerClassName: org.apache.pinot.plugin.ingestion.batch.standalone.SegmentGenerationJobRunner,
  segmentMetadataPushJobRunnerClassName: org.apache.pinot.plugin.ingestion.batch.standalone.SegmentMetadataPushJobRunner,
  segmentTarPushJobRunnerClassName: org.apache.pinot.plugin.ingestion.batch.standalone.SegmentTarPushJobRunner,
  segmentUriPushJobRunnerClassName: org.apache.pinot.plugin.ingestion.batch.standalone.SegmentUriPushJobRunner}
failOnEmptySegment: false
includeFileNamePattern: null
inputDirURI: null
jobType: SegmentMetadataPush
outputDirURI: <gs://ica-pinot-sedcasb-feature-pinot/data/immutable_events>
overwriteOutput: false
pinotClusterSpecs:
- {controllerURI: '<http://pinot-controller:9000>'}
pinotFSSpecs:
- {className: org.apache.pinot.plugin.filesystem.GcsPinotFS, configs: null, scheme: gs}
pushJobSpec: {pushAttempts: 2, pushFileNamePattern: null, pushParallelism: 1, pushRetryIntervalMillis: 1000,
  segmentUriPrefix: null, segmentUriSuffix: null}
recordReaderSpec: null
segmentCreationJobParallelism: 0
segmentNameGeneratorSpec: null
tableSpec: {schemaURI: '<http://pinot-controller:9000/tables/immutable_events/schema>',
  tableConfigURI: null, tableName: immutable_events}
tlsSpec: null

Trying to create instance for class org.apache.pinot.plugin.ingestion.batch.standalone.SegmentMetadataPushJobRunner
Initializing PinotFS for scheme gs, classname org.apache.pinot.plugin.filesystem.GcsPinotFS
Configs using default credential
Listed 1162 files from URI: <gs://ica-pinot-sedcasb-feature-pinot/data/immutable_events>, is recursive: true
Start pushing segment metadata: {} to locations: [org.apache.pinot.spi.ingestion.batch.spec.PinotClusterSpec@3afae281] for table immutable_events
Here's our config:
Copy code
executionFrameworkSpec:
  name: 'standalone'
  segmentGenerationJobRunnerClassName: 'org.apache.pinot.plugin.ingestion.batch.standalone.SegmentGenerationJobRunner'
  segmentTarPushJobRunnerClassName: 'org.apache.pinot.plugin.ingestion.batch.standalone.SegmentTarPushJobRunner'
  segmentUriPushJobRunnerClassName: 'org.apache.pinot.plugin.ingestion.batch.standalone.SegmentUriPushJobRunner'
  segmentMetadataPushJobRunnerClassName: 'org.apache.pinot.plugin.ingestion.batch.standalone.SegmentMetadataPushJobRunner'

# Recommended to set jobType to SegmentCreationAndMetadataPush for production environment where Pinot Deep Store is configured  
jobType: SegmentMetadataPush

outputDirURI: '<gs://ica-pinot-sedcasb-feature-pinot/data/immutable_events>'
pinotFSSpecs:
  - scheme: gs
    className: 'org.apache.pinot.plugin.filesystem.GcsPinotFS'
pinotClusterSpecs:
  - controllerURI: '<http://pinot-controller:9000>'
pushJobSpec:
  pushAttempts: 2
  pushRetryIntervalMillis: 1000
tableSpec:
  tableName: 'immutable_events'
  schemaURI: '<http://pinot-controller:9000/tables/immutable_events/schema>'
I saw this line in SegmentPushUtils.sendSegmentUriAndMetadata that seems to expect .tar.gz extensions
Copy code
Preconditions.checkArgument(fileName.endsWith(Constants.TAR_GZ_FILE_EXT));
CC @Kartik Khare
s
So also for some clarification, we lost our entire namespace and had no backup of zookeeper, so thinking we lost our metadata for segments so we won't be able to restore at all with just the segments. This note makes me feel like we're not going to be able to restore
Note: Deep Store by itself is not sufficient for restore operations. Pinot stores metadata such as table config, schema, segment metadata in Zookeeper. For restore operations, both Deep Store as well as Zookeeper metadata are required.
Which is fine, we just need confirmation. This was only a dev environment and we are practicing DR stuff. CCC @Neha Pawar
n
what’s the SegmentGenerationJobRunner you’re using @Aaron Weiss? I see atleast
SegmentGenerationJobRunner
and
SparkSegmentGenerationJobRunner
will create file with .tar.gz
b
These are segments that were created by realtime ingestion
s
and offline segments created w/ the offline rollup job
We renamed one to add .tar.gz to the end and it loaded fine.
b
ah yes
n
i see then might be a bug in the SegmentGenerator method used by the rollup job. This is RealtimeToOffline ?
a
yep, RealtimeToOfflineSegmentsTask
n
so your R2O job pushed segmetns to offline table just fine. But when you tried to do a reupload for DR, then it failed, is that right?
a
correct, because the files in GCS have no file extension (even though they are valid .tar.gz)
and the reupload expects the .tar.gz extensions to be on the segment files
n
Adding @Haitao Zhang, who I think was going to add Metadata push support natively to R2O. I’m guessing as part of that, we’ll encounter this issue and would have to fix it. Would you mind filing an issue?
and regarding the DR, yes as of now if you lose zk data (so mainly table configs wiped out), you wont be able to restore. But if you had table configs, you could still potentially upload all. cc @Rong R who had a idea to implement a tool to save the whole snapshot to a deep store (beyond the tar.gz segments)
👀 1
s
Well I think with this rename of the file we are just fine
a
We recreated the table and schema prior to attempting to reload
s
That was our issue, the job required the files to have that extension, once we added it, our segment loaded fine.
a
@Neha Pawar / @Haitao Zhang We created https://github.com/apache/pinot/issues/9307 for this issue. Thanks for the help!
👍 1
@Neha Pawar / @Kartik Khare The SegmentDataPush job seems to only be available for OFFLINE tables. We mentioned the realtime upsert enabled table we have earlier in this thread. When I attemped to restore segments for that table, it failed with:
Copy code
Caught temporary exception while pushing table: mutable_events segment: mutable_events__0__3__20220820T0921Z to <http://localhost:9000>, will retry
org.apache.pinot.common.exception.HttpErrorStatusException: Got error status code: 500 (Internal Server Error) with reason: "Exception while uploading segment: Table config is not available for table 'mutable_events_OFFLINE'" while sending request
I found another recent thread that suggests that the REALTIME capability was added in 0.11? We're on 0.10 right now. https://apache-pinot.slack.com/archives/C011C9JHN7R/p1661952406323549?thread_ts=1661948120.349419&amp;cid=C011C9JHN7R
It looks like this commit added the capability to 0.11: https://github.com/apache/pinot/commit/a9cf7a8b2bf36b3454bb27384aee83bc45bf1531
We're able to workaround it in 0.10 by doing one file at a time using:
Copy code
curl -X POST -F segment=@"mutable_events__0__3__20220820T0921Z.tar.gz" <http://localhost:9000/v2/segments>\?tableName\=mutable_events\&tableType\=REALTIME
{"status":"Successfully uploaded segment: mutable_events__0__3__20220820T0921Z of table: mutable_events_REALTIME"}
h
fixed the segment name check in metadata push https://github.com/apache/pinot/pull/9359
1