has anyone tried to move the segments folder if yo...
# getting-started
l
has anyone tried to move the segments folder if you use google to bigquery? to do big data computations that may not be possible in pinot?
m
The segments files are written in a format that's specific to pinot, I don't think bigquery would know what to do with them? (I think that's what you're asking but correct me if not)
l
yea I guess that my question then is is there any easy way to offload data in pinot to a big data store such as big query?
m
that is a good question. A lot of Pinot users are ingesting data from one of the streaming tools (mostly Kafka), in which case the source of truth of the data would be Kafka rather than Pinot. I'm assuming you have a scenario where Pinot is the source of truth for the data?
Gonna tag in @User in case he has some insight on data export from Pinot
m
Thanks @User, you are right. Typical usage pattern is to maintain a source of truth of original data outside of Pinot.
l
I see, is it correct to assume that anything that gets into a kafka topic makes it to pinot then? my initial thought was to treat kafka as the source of truth as well but as not sure if there was data drop or something like that between kafka and pinot if it’s pretty much equal then yes i think this is pretty much solvable with a sink from the topic to bigquery
m
Pinot embeds a Kafka client that takes messages off a topic and builds segments from the content of those messages, so yeh I think you can assume the data on the Kafka topic == the data in Pinot [Mayank's answer is better, read that!]
m
Typically the stream data is ETL'd into data lake and kept as source of truth. Pinot does not enforce 1-1 schema match with Kafka topic, infact, it allows for transformations, reading subset of schema etc, in which case, it won't have the Kafka data.
The offline ETL pipeline is supposed to take care of correcting any issues in the stream data
l
so basically we would treat the topic as the source of truth and not pinot is that correct?
and pinot itself has assurances that data is not lost between kafka and itself
m
Yes, correct.
l
so in my case i’m trying to validate that the data that we are ingesting is equal to what our current system has, so I got stuck for a second cause I cannot just query pinot abruptly so I was thinking that maybe if I had access to segments i could upload to bigquery and then do the checks there
so that’s why in this case i was weirded out by treating the topic from where pinot consumes from as the source of truth instead of pinot
I guess that the main issue is that we cannot issue a crazy query in pinot or something like that that would take several records and sync them to bigquery
m
Yes, that's not the right usage for Pinot.
You can still achieve the validation if you independently put the data into BQ as well as Pinot. If you ingest and download from Pinot, "technically" that approach is not ideal because the ingestion/download path from the system being tested might have an issue (not saying that is the case for Pinot)