Is it possible to batch ingest via Apache Flink to...
# general
r
Is it possible to batch ingest via Apache Flink to the HDFS as deep store in Apache Pinot?
m
Could your elaborate a bit? Is the data available in HDFS already, or is it in some stream system like Kafka/kinesis?
p
data available in HDFS
m
If data is already available in hdfs then you can use ingestion job to ingest data in Pinot
p
but we need to run some transformation on HDFS data, we can use flink job to do same after that we can push same to HDFS(transformedData). we can use input directory as transformedData and output directoty (Pinot Deep store) in ingestion job.
can we configured deepstore as output directory in ingestion job yaml
m
You can do metadata push where you send only uri of segment on hdfs to controller along with metadata
k
@User - yes, you can set up a Pinot segment generation job (the YAML file) to read input data from HDFS, and write the resulting segment files to HDFS. Note that currently Pinot does NOT support using Flink for segment generation, it has to be either standalone (running as a regular Java process), or Hadoop MapReduce, or Spark. We use Flink to generate CSV files, then run MapReduce to create the segments (reading CSV from HDFS, and writing segments to HDFS). Then we do what @User suggests and use “metadata push” to tell the Pinot cluster about the segments.
m
Thanks @User for filling in the details, much appreciated!
k
@User can help in generating segments directly from Flink and push to Pinot
I think they do this at css
e
This was our first approach - directly generating the segment, you can use either the avro record reader or generic row record reader, uploading to deepstore and using the FileUploadDownload client to give the controller the download url. The current approach we use is to use insert into pinot, but the pr is not merged upstream yet. @User does that help?
lmk if you need any more info
k
Hi @User - thanks for details, though it’s @User that’s asking about how to use Flink to build/push segments. But curious, you wrote Flink operators to wrap Pinot code to generate segments as part of a Flink batch job, right? Also, not sure what you mean by your current approach “…use insert into pinot”.
e
We found that using flink was a lot of overhead for data engineers, so we implemented "insert into <pinot table> select ...". For our use case the data is all hooked up to our datawarehouse so this is much easier.
e
@User definitely curious to learn more if you’re willing to share 🙂 Do you have a link to the PR you mentioned that’s yet to be merged? And what are you using as a warehouse if you don’t mind me asking?