Hi team, I am trying to bootstrap realtime upsert...
# general
n
Hi team, I am trying to bootstrap realtime upsert enabled table. I have around 2-3 years that I want to upload to this realtime table. I was trying to utilize the segment generation using spark to create segments and then upload those segments to realtime table. But the initial segment creation job itself fails as it tries to search for OFFLINE table in the table config. I couldn't find any better guide/documentation to perform this. I was just going through whatever changes is there in this PR https://github.com/apache/pinot/pull/6567 and was trying accordingly
m
@User @User ^^
j
@User Could you please share the steps of generating segments for upsert table? Do you use pinot spark job or some custom job?
n
I am using spark pinot job. I have some parquet files present in s3 which I am trying to load. I have attached the ingestion yaml file
t
cc @User I think yupeng used Pinot Flink connector?
y
yes, you need flink for this
take a look at this guide
m
@User @User If this is a supported flow (as per https://github.com/apache/pinot/pull/6567), could we also support spark based push?
n
Is there any possible way to create and load segments in realtime tables without using flink cluster and without pushing the complete bootstrap data to kafka??
h
we have a similar use case where we want to bootstrap upsert enabled pinot realtime table with 100 days of data and then perform realtime upserts using kafka(flink stream pushes to kafka). i tried to create and load segments to realtime table using spark but it failed. also this quickstart shows example using flink stream. is there a way to do this using flink batch? or some other approach to load segments. @Nisheet did u find any solution for this @Jackie @Ting Chen @Mayank @Yupeng Fu any suggestions on this
y
if you want to use spark, then you'll need to implement similar logic in this flink/pinot connector
h
i just want to load 100days of data which is present in s3 to upsert enabled realtime table. if it can be done by flink batch that also works for me. the code mentioned in thread is for flink stream.