Apache Pinot

Hey Everyone, I have data in Postgres that I need to send to Pinot dimension table. Can this be done using Apache Spark batch job? What would be the other/better ways of doing it?

Yeah you can write a batch job to pull data from Postgres in a format like csv, avro, parquet. Once you have that you can use the ingestion job to import data into Pinot 

Hi <@UDQU92KBK> .. thanks for the reply. From the documentation I see that the following sources are supported for ingesting into Pinot: Amazon S3,Google Cloud Storage,HDFS and Azure Data Lake Storage

I wanted to read data from Postgres and write directly to Pinot using Spark (without storing in HDFS). But it seems it is not possible?

Since the data that I want to populate in Pinot dimension table is PII, storing it in HDFS is not the most desirable solution.

Yeah, right now you need two steps, 1) export data from source 2) ingest into pinot

The reason we suggest writing to deepstore (hdfs etc) before pushing to Pinot is to handle the case where spark tasks fail.

the entire job might get restarted etc resulting in duplicate data

Hi Kishore.. thanks for the comment. It makes sense.

I’m assuming your data size is small given you’re using a dimension table. One option could be use something like debezium to fetch cdc log stream from Postgres and ingest that into Pinot 

Of course that means you’ll be reading the entire changelog which can get expensive but should be fine for small datasets . The beauty of doing this is you get future updates for free

Hi <@UG5A7D0A3>… sorry for a (very) late response. But thanks a lot for your reply!