Hey Everyone, I have data in Postgres that I need to send to Pinot dimension table. Can this be done using Apache Spark batch job? What would be the other/better ways of doing it?
01/22/2022, 1:27 AM
Yeah you can write a batch job to pull data from Postgres in a format like csv, avro, parquet. Once you have that you can use the ingestion job to import data into Pinot
01/22/2022, 1:45 AM
Hi @User .. thanks for the reply. From the documentation I see that the following sources are supported for ingesting into Pinot: Amazon S3,Google Cloud Storage,HDFS and Azure Data Lake Storage
I wanted to read data from Postgres and write directly to Pinot using Spark (without storing in HDFS). But it seems it is not possible?
Since the data that I want to populate in Pinot dimension table is PII, storing it in HDFS is not the most desirable solution.
01/22/2022, 1:47 AM
Yeah, right now you need two steps, 1) export data from source 2) ingest into pinot
01/22/2022, 1:48 AM
01/22/2022, 3:12 AM
The reason we suggest writing to deepstore (hdfs etc) before pushing to Pinot is to handle the case where spark tasks fail.
the entire job might get restarted etc resulting in duplicate data
01/22/2022, 3:14 AM
Hi Kishore.. thanks for the comment. It makes sense.
01/22/2022, 3:28 AM
I’m assuming your data size is small given you’re using a dimension table. One option could be use something like debezium to fetch cdc log stream from Postgres and ingest that into Pinot
Of course that means you’ll be reading the entire changelog which can get expensive but should be fine for small datasets . The beauty of doing this is you get future updates for free
02/16/2022, 5:25 AM
Hi @User… sorry for a (very) late response. But thanks a lot for your reply!