Hi Everyone, We are evaluating Apache Pinot for o...
# general
s
Hi Everyone, We are evaluating Apache Pinot for our analytical use case. We have encountered some scenarios for which we didn't get proper justification yet. Please help us to understand the reasoning behind them & how to address those scenarios; 1. Why insert to Pinot table via Presto connector is not supported as almost all other SQL commands are supported ? 2. Why updating records using update query is not allowed on Pinot table via Presto ? 3. If we want to replicate same set of data values in a Pinot table how to do it at present without Kafka Ingestion ? Ex: Existing 1M records we want to multiple by insert into TableA ( select * from TableA ). As Presto connector not allowed to insert into table and Pinot itself doesn't support subqueries, hence those 2 options are not there. 4. If we made some mistake adding column name during schema creation, and later update the schema, will the previous ingested data values for that column will automatically considered ? Ex: Realtime table has 1 column called "NAME", which is supposed to be mentioned as "name". So as the Kafka stream data, previously ingested have values for "name" attribute, so after schema change will Pinot automatically update values for all rows or we need to retrofit "name" values again ? If need to retrofit, what is the best possible way ? 5. Can a single query read from both REALTIME & OFFLINE tables ? As subqueries & joins are not supported directly by Pinot, is there any way, we can achieve that ?
m
Copy code
1. Ingestion in Pinot has traditionally been via offline and realtime stream. Could you elaborate your usecase that requires insert of rows via Presto?
2. Upsert in Pinot is a newer feature and requires a primary key to identify a row to be updated. While we may definitely explore update via Presto, it might still be primary key based (as opposed to any generic condition).
3. If you don't want to use Kafka ingestion, you can push the data via offline pipeline.
4. Schema changes have to be backward compatible, which your example isn't.
5. Offline and realtime tables are internal to Pinot. Client side only sees a single hybrid table, and Pinot answers query including the offline and realtime data.
x
For presto Pinot integration, we only connect the query path. No support for table and data ops.
m
@User Based on your questions, I am really curious on what your use case is, and how you are trying to use Pinot. Could you please share some details about that?
s
So for inserting records via Pinot, we primarily want to ingest a load by replicating same records. As for testing we are pushing data to kafka via Open network, so the process was bit slow. So as a workaround we thought why can't replicate same set of records in the same table by running an insert command via Presto.
m
Why is pushing data via kafka slow?
Also, are you saying all the points you mentioned above are in the context of testing? If so, what are you trying to test?
s
Okay, let me explain. We have a JSON file of 30GB (consist of 10M records) in our system. As per Pinot documentation, we setup an EKS cluster on AWS and we are providing input to kafka from out local. So Kafka is trying to stream from 30GB JSON. 10M records is taking almost 2 hours to load to Pinot as we are uploading to kafka via open-network. So internet bandwidth plays a role I believe. Now, after few iteration, once we got 50M records, we want to make it 150M directly by replicating same 50M records twice. Hence we are looking for insert command to push to same table from existing records.
m
You can also ingest data via batch ingestion (ie not Kafka): https://docs.pinot.apache.org/basics/data-import/batch-ingestion
10M records should take a few seconds, if it is taking 2 hours, you have a bottleneck elsewhere (I am unsure it is network either, unless you have a severely limited b/w).
s
Batch Ingestion is for Offline tables right ? Is there any other way apart from Kafka to ingest in Realtime table ?
m
You have a JSON file which is offline, correct? If so, why do you want to pump it via Kafka? Why does it matter if this goes to an offline or realtime table? Could you please elaborate your end goal with this exercise?