Hi Everyone We are evaluating Apache Pinot for our analytica Apache Pinot #general

Hi Everyone, We are evaluating Apache Pinot for o...

Soumya Dey

07/04/2021, 7:29 PM

Hi Everyone, We are evaluating Apache Pinot for our analytical use case. We have encountered some scenarios for which we didn't get proper justification yet. Please help us to understand the reasoning behind them & how to address those scenarios; 1. Why insert to Pinot table via Presto connector is not supported as almost all other SQL commands are supported ? 2. Why updating records using update query is not allowed on Pinot table via Presto ? 3. If we want to replicate same set of data values in a Pinot table how to do it at present without Kafka Ingestion ? Ex: Existing 1M records we want to multiple by insert into TableA ( select * from TableA ). As Presto connector not allowed to insert into table and Pinot itself doesn't support subqueries, hence those 2 options are not there. 4. If we made some mistake adding column name during schema creation, and later update the schema, will the previous ingested data values for that column will automatically considered ? Ex: Realtime table has 1 column called "NAME", which is supposed to be mentioned as "name". So as the Kafka stream data, previously ingested have values for "name" attribute, so after schema change will Pinot automatically update values for all rows or we need to retrofit "name" values again ? If need to retrofit, what is the best possible way ? 5. Can a single query read from both REALTIME & OFFLINE tables ? As subqueries & joins are not supported directly by Pinot, is there any way, we can achieve that ?

Mayank

07/04/2021, 11:46 PM

Copy code

1. Ingestion in Pinot has traditionally been via offline and realtime stream. Could you elaborate your usecase that requires insert of rows via Presto?
2. Upsert in Pinot is a newer feature and requires a primary key to identify a row to be updated. While we may definitely explore update via Presto, it might still be primary key based (as opposed to any generic condition).
3. If you don't want to use Kafka ingestion, you can push the data via offline pipeline.
4. Schema changes have to be backward compatible, which your example isn't.
5. Offline and realtime tables are internal to Pinot. Client side only sees a single hybrid table, and Pinot answers query including the offline and realtime data.

Xiang Fu

07/04/2021, 11:46 PM

For presto Pinot integration, we only connect the query path. No support for table and data ops.

Mayank

07/04/2021, 11:47 PM

@User Based on your questions, I am really curious on what your use case is, and how you are trying to use Pinot. Could you please share some details about that?

Soumya Dey

07/05/2021, 4:06 AM

So for inserting records via Pinot, we primarily want to ingest a load by replicating same records. As for testing we are pushing data to kafka via Open network, so the process was bit slow. So as a workaround we thought why can't replicate same set of records in the same table by running an insert command via Presto.

Mayank

07/05/2021, 4:57 AM

Why is pushing data via kafka slow?

Mayank

07/05/2021, 4:58 AM

Also, are you saying all the points you mentioned above are in the context of testing? If so, what are you trying to test?

Soumya Dey

07/05/2021, 5:26 AM

Okay, let me explain. We have a JSON file of 30GB (consist of 10M records) in our system. As per Pinot documentation, we setup an EKS cluster on AWS and we are providing input to kafka from out local. So Kafka is trying to stream from 30GB JSON. 10M records is taking almost 2 hours to load to Pinot as we are uploading to kafka via open-network. So internet bandwidth plays a role I believe. Now, after few iteration, once we got 50M records, we want to make it 150M directly by replicating same 50M records twice. Hence we are looking for insert command to push to same table from existing records.

Mayank

07/05/2021, 5:38 AM

You can also ingest data via batch ingestion (ie not Kafka): https://docs.pinot.apache.org/basics/data-import/batch-ingestion

Mayank

07/05/2021, 5:39 AM

10M records should take a few seconds, if it is taking 2 hours, you have a bottleneck elsewhere (I am unsure it is network either, unless you have a severely limited b/w).

Soumya Dey

07/05/2021, 6:06 AM

Batch Ingestion is for Offline tables right ? Is there any other way apart from Kafka to ingest in Realtime table ?

Mayank

07/05/2021, 7:41 PM

You have a JSON file which is offline, correct? If so, why do you want to pump it via Kafka? Why does it matter if this goes to an offline or realtime table? Could you please elaborate your end goal with this exercise?

3 Views

Open in Slack

Previous Next