[Question] Hi, I’ve configured Apache Pinot with d...
# general
y
[Question] Hi, I’ve configured Apache Pinot with deep store connected to Google Cloud Storage. Does this mean that some cold (less frequently used) segments will be persisted in GCS while hot segments will be served as sort of “cache” in Pinot Servers? I’m curious whether • All the segments are distributed in Pinot servers OR • Only frequently used segments are cached in Pinot servers while unused segments are stored in Deep store (Such as GCS/S3/Azure Data Lake Storage / HDFS) I’m asking this because if we add more and more data, I was concerned whether the number of nodes always increase.
p
If you are not using Tiered Storage, all the segments will be stored in both the pinot servers, as well as deep store like S3. That is done basically for reliability and fault tolerance. If a server node goes down, then another server node that comes up to replace it can download segments from the deep store.
https://www.startree.ai/blog/introducing-tiered-storage gives a good introduction to tiered storage.
y
I see. So the deep store is mainly for reliability and fault tolerance of the nodes, instead of using for cold data.
Even with tiered storage, all the segments are persisted in the persistent disks such as SSD / HDD configured with the nodes (although could be using different persistent disk types), is this correct?
p
You will also use it for cold data. For ex. if you have a hybrid table which combines offline + online tables, then you can push segments for offline table to the deep store.
Even with tiered storage, all the segments are persisted in the persistent disks such as SSD / HDD configured with the nodes (although could be using different persistent disk types), is this correct?
- not exactly. i think right now you can have a tier where segments are stored on pinot servers and backed in s3 bucket and then have another tier where segments are stored only in a s3 bucket. the queries will actually run off segments in s3 bucket for the segments in lower tier in that case, and will run against the segments on pinot server for segments in the higher tier. Does this help. I would highly recommend reading the link I shared above. That should clarify better.
y
Oh I think that clarified further. So the tiered storage supports SSD/HDD but also blob storage solutions like GCS/S3.
👍 1
Thank you so much for providing clear answer!
👍 1
p
One thing that I missed initially - If you don't configure deep store then segments will be saved to controller disc and i kept wondering why i keep running into full disc errors 😂
😲 1
y
I see. We are currently only trying to use Batch ingestion option with spark, with deep store configured to GCS. In this case, the segments live in pinot servers, right?
p
if you don't use tiered storage then the segments will live in both pinot server disc and GCS bucket.
👍 1
y
Got it. Now I learned that I should use tiered storage feature (independent feature from Deep Storage) for treating hot/cold data separately.