Hi folks, I’m trying to figure out what would be t...
# troubleshooting
g
Hi folks, I’m trying to figure out what would be the best option for us to backfill some data into an offline table. standalone is not an option as it involves a lot of data. Remaining options: minions or Spark. Do minions generate 1 segment per input file? The reason I ask Is that the offline data currently is stored in files with 100K max documents, it would be better to increase that number. Data is also not completely in order so there would be potential for data loss (I’m assuming). In spark, how are segments generated? How is the size determined?
m
In OSS, 1 input file becomes one segment, so you’ll have to use merge-roll up task after ingestion. Or if you can use a spark job to merge them before ingestion, that’s another option
g
Is it also possible to ingest from Hive and generate the segments?
k
We don’t have a hive connector .. will be great to add a hive reader implementation..
1
g
I guess I can use a SegmentGenerationAndPushTask + MergeRollupTask. Is it possible for a minion to read a custom file type (like during realtime ingestion) during SegmentGenerationAndPushTask? During SegmentGenerationAndPushTask, can it read from filesystem A but write the output to filesystem B?
m
That’s what Kishore is hinting at. You need to write a record reader for hive. That’s all is needed to read a format and generate index. That can then be hooked up to existing minion framework.
g
I’m not talking about Hive here. I’m debating 2 different input sources. Hive might be for the future but the file based method matches best with our current offline path. The idea in my mind is: blob store with input data -> minion -> S3 deep store but I’m not sure this is possible with just plugin dev
m
What’s the format of data in blob store? Data push to Pinot is essentially writing to deepstore (metadata push job).
g
It’s a binary format with protobuffers, 100K protobufs each
m
Yeah, there’s a protobuf reader already iirc
g
I’m aware. Unfortunately these protbufs are not flat. I created our own reader that flattens the fields. I’m just wondering how the file reading would be handled
m
You could flatten using ingestion transforms as well (most of te times). I am unclear on what the issue with file reading is. Is it that blob store does not give a file system / directory structure or something else?
g
ooh wait. This requires a RecordReader for this format. which implements the parsing of the file. Sorry I forgot about that. So it would be a combination of BlockStorePinotFS (implemented) + Custom Record Reader (todo) + Custom Proto Extractor (implemented)