Hi folks I m trying to figure out what would be the best opt Apache Pinot #troubleshooting

Hi folks, I’m trying to figure out what would be t...

Gerrit van Doorn

08/08/2022, 8:07 PM

Hi folks, I’m trying to figure out what would be the best option for us to backfill some data into an offline table. standalone is not an option as it involves a lot of data. Remaining options: minions or Spark. Do minions generate 1 segment per input file? The reason I ask Is that the offline data currently is stored in files with 100K max documents, it would be better to increase that number. Data is also not completely in order so there would be potential for data loss (I’m assuming). In spark, how are segments generated? How is the size determined?

Mayank

08/08/2022, 8:10 PM

In OSS, 1 input file becomes one segment, so you’ll have to use merge-roll up task after ingestion. Or if you can use a spark job to merge them before ingestion, that’s another option

Gerrit van Doorn

08/08/2022, 8:36 PM

Is it also possible to ingest from Hive and generate the segments?

Kishore G

08/08/2022, 8:42 PM

We don’t have a hive connector .. will be great to add a hive reader implementation..

➕ 1

Gerrit van Doorn

08/10/2022, 10:52 PM

I guess I can use a SegmentGenerationAndPushTask + MergeRollupTask. Is it possible for a minion to read a custom file type (like during realtime ingestion) during SegmentGenerationAndPushTask? During SegmentGenerationAndPushTask, can it read from filesystem A but write the output to filesystem B?

Mayank

08/10/2022, 10:56 PM

That’s what Kishore is hinting at. You need to write a record reader for hive. That’s all is needed to read a format and generate index. That can then be hooked up to existing minion framework.

Gerrit van Doorn

08/10/2022, 10:59 PM

I’m not talking about Hive here. I’m debating 2 different input sources. Hive might be for the future but the file based method matches best with our current offline path. The idea in my mind is: blob store with input data -> minion -> S3 deep store but I’m not sure this is possible with just plugin dev

Mayank

08/10/2022, 11:02 PM

What’s the format of data in blob store? Data push to Pinot is essentially writing to deepstore (metadata push job).

Gerrit van Doorn

08/10/2022, 11:02 PM

It’s a binary format with protobuffers, 100K protobufs each

Mayank

08/10/2022, 11:03 PM

Yeah, there’s a protobuf reader already iirc

Gerrit van Doorn

08/10/2022, 11:05 PM

I’m aware. Unfortunately these protbufs are not flat. I created our own reader that flattens the fields. I’m just wondering how the file reading would be handled

Mayank

08/10/2022, 11:06 PM

You could flatten using ingestion transforms as well (most of te times). I am unclear on what the issue with file reading is. Is it that blob store does not give a file system / directory structure or something else?

Gerrit van Doorn

08/10/2022, 11:10 PM

ooh wait. This requires a RecordReader for this format. which implements the parsing of the file. Sorry I forgot about that. So it would be a combination of BlockStorePinotFS (implemented) + Custom Record Reader (todo) + Custom Proto Extractor (implemented)

Open in Slack

Previous Next