has anyone tried to move the segments folder if you use goog Apache Pinot #getting-started

has anyone tried to move the segments folder if yo...

Luis Fernandez

12/01/2021, 7:27 PM

has anyone tried to move the segments folder if you use google to bigquery? to do big data computations that may not be possible in pinot?

Mark Needham

12/08/2021, 4:46 PM

The segments files are written in a format that's specific to pinot, I don't think bigquery would know what to do with them? (I think that's what you're asking but correct me if not)

Luis Fernandez

12/08/2021, 4:49 PM

yea I guess that my question then is is there any easy way to offload data in pinot to a big data store such as big query?

Mark Needham

12/08/2021, 5:04 PM

that is a good question. A lot of Pinot users are ingesting data from one of the streaming tools (mostly Kafka), in which case the source of truth of the data would be Kafka rather than Pinot. I'm assuming you have a scenario where Pinot is the source of truth for the data?

Mark Needham

12/08/2021, 5:05 PM

Gonna tag in @User in case he has some insight on data export from Pinot

Mayank

12/08/2021, 5:06 PM

Thanks @User, you are right. Typical usage pattern is to maintain a source of truth of original data outside of Pinot.

Luis Fernandez

12/08/2021, 5:08 PM

I see, is it correct to assume that anything that gets into a kafka topic makes it to pinot then? my initial thought was to treat kafka as the source of truth as well but as not sure if there was data drop or something like that between kafka and pinot if it’s pretty much equal then yes i think this is pretty much solvable with a sink from the topic to bigquery

Mark Needham

12/08/2021, 5:09 PM

Pinot embeds a Kafka client that takes messages off a topic and builds segments from the content of those messages, so yeh I think you can assume the data on the Kafka topic == the data in Pinot [Mayank's answer is better, read that!]

Mayank

12/08/2021, 5:10 PM

Typically the stream data is ETL'd into data lake and kept as source of truth. Pinot does not enforce 1-1 schema match with Kafka topic, infact, it allows for transformations, reading subset of schema etc, in which case, it won't have the Kafka data.

Mayank

12/08/2021, 5:11 PM

The offline ETL pipeline is supposed to take care of correcting any issues in the stream data

Luis Fernandez

12/08/2021, 5:15 PM

so basically we would treat the topic as the source of truth and not pinot is that correct?

Luis Fernandez

12/08/2021, 5:16 PM

and pinot itself has assurances that data is not lost between kafka and itself

Mayank

12/08/2021, 5:21 PM

Yes, correct.

Luis Fernandez

12/09/2021, 9:10 PM

so in my case i’m trying to validate that the data that we are ingesting is equal to what our current system has, so I got stuck for a second cause I cannot just query pinot abruptly so I was thinking that maybe if I had access to segments i could upload to bigquery and then do the checks there

Luis Fernandez

12/09/2021, 9:22 PM

so that’s why in this case i was weirded out by treating the topic from where pinot consumes from as the source of truth instead of pinot

Luis Fernandez

12/09/2021, 9:23 PM

I guess that the main issue is that we cannot issue a crazy query in pinot or something like that that would take several records and sync them to bigquery

Mayank

12/09/2021, 10:20 PM

Yes, that's not the right usage for Pinot.

Mayank

12/09/2021, 10:21 PM

You can still achieve the validation if you independently put the data into BQ as well as Pinot. If you ingest and download from Pinot, "technically" that approach is not ideal because the ingestion/download path from the system being tested might have an issue (not saying that is the case for Pinot)

Open in Slack

Previous Next