I asked it in <#CDRCA57FC|general> but didn't have...
# troubleshooting
p
I asked it in #general but didn't have much luck so trying it here. What is the recommended approach for batch ingestion of data from let's say either S3 or Hive into Pinot between minion based ingestion v/s standalone ingestion v/s ingestion via a spark job? Are there any pros / cons between the three?
d
I'm still a newbie in Pinot, but I'll risk answering (I recommend waiting for others' opinions too): it's possible to run ingestion jobs anywhere, provided that you have the Pinot CLI available and have your Pinot cluster accessible where the ingestion job is run at. So it could be run in the Minion, in another server, out of the cloud, your own computer, or even in the Pinot Controller itself (not recommended). So I guess it all comes to what's more convenient for you, and what format better attends to your needs.
👍 1
p
Thank you
d
I'm developing a project where I'll use Argo Workflows, and each Workflow (job in execution) will run a new pod in Kubernetes where it will run the ingestion job inside that pod to generate the segments, which will then be pushed to the Pinot cluster.
No problem 🙂
s
We are doing batch ingestion from Hive tables build atop ORC files on s3 using spark jobs running on spark-k8s-operator, scheduled on Airflow
p
Thanks Satyam. Was there any specific reason for choosing spark jobs v/s ingestion via minion framework? Did you have to run the spark jobs on the same k8s cluster as pinot for networking to work?
k
@Priyank Bagrecha - you don’t have to run the job (Spark, Hadoop, or stand-alone) on the same servers as where Pinot is running (and you wouldn’t want to either). One of the big advantages of Spark/Hadoop (and now Flink) workflows is that they can scale independently of your Pinot cluster. e.g. we can run our Flink job on a 16 server cluster, using all available memory/CPU, and build 1200 big segments in a few hours. Another advantage I wanted to mention in using Spark/Flink/Hadoop is that your workflow can do much more complex data preparation before building the Pinot segments. E.g. we do a lot of text extraction/analysis that would be hard/impossible inside of Pinot. Finally, if you’re building segments outside of Pinot, then you can offload the CPU cycles needed to build indexes, by pre-building them into the segments. The segments get a bit bigger, but you won’t get CPU spikes when loading segments on your Pinot cluster.
d
Wait, what's that about pre-building indexes into segments? I didn't know that feature, is there any doc about that?
k
In Table Index Config, see
createInvertedIndexDuringSegmentGeneration
There’s also recent work on support for pre-building bloom filters
d
Very interesting! Thanks for the hint, man, I'll definitely use that, I'm building a system where I generate the segments outside of the Pinot cluster
p
@Ken Krugler I am sorry if I implied running the job (spark, hadoop, or stand-alone) on the same servers as Pinot. I am deploying pinot on a k8s cluster and discovered that presto / trino has to be deployed on the same k8s cluster as well for the integration to work so pinot brokers can be discovered by presto. So my question was alone the same line - would i need to run the job on the same k8s cluster? or can they be run from a completely independent k8s cluster? I hope this clears it up. Thank you for sharing the details about your setup.
k
I normally run my batch jobs to only generate (not push) segments, so in that case your k8s cluster only needs to be able to write to wherever your deep store is located. If you also want to push segments, then your k8s cluster needs to be able to talk to the Pinot cluster via the REST API, so appropriate network routing and open ports is required.
p
Good point. Thank you guys, this was very helpful.
k
@Diogo Baeder - looking at
SegmentGeneratorConfig.java
, I think that columns with JSON index (or in next release, range index and bloom filters) will also have their indexes pre-built if they are listed in the indexing config. The documentation currently states that this is only done for inverted indexes. @saurabh dubey and @Kishore G should confirm. Also I don’t see any PR/tag for updating docs for the above recent change in May…
s
@Diogo Baeder @Ken Krugler Yes with #8601 Add support for indexes during offline segment creation, range and bloom indexes will be prebuilt during offline segment ingestion by default (In addition to dictionary, forward encoding, inverted, json, text etc. already being built offline). As @Ken Krugler mentioned, this is being done to improve segment load time, and offload CPU intensive index building. I'll get the documentation updated to reflect this change 👍
👍 2
d
Thanks @saurabh dubey, that's nice to hear!