How does Pinot scale with offline tables? I get th...
# general
n
How does Pinot scale with offline tables? I get the impression that every offline segment is loaded into an active offline server, which implies all of your offline data is loaded in some server. This seems very expensive, especially for something like 2 year old data. Does pinot lazily load old segments based on query demand? And how do indexes scale into offline tables?
m
We mmap the indexes, so they get paged in as needed. Depending on your sla requirements, you can use SSD or regular HDD on server nodes
n
I'm talking more for something like using s3 as offline access
I'm looking for something that can hit low latency SLAs but retire old data to s3 daily or weekly. This will be large volumes of data (200k+ messages/sec), so we can't really be keeping all of it in normal storage.
Clickhouse + Parquet files in s3 + Presto is a workable solution, but doesn't really give you any indexing in offline mode. Pinot looked interesting in that it might bridge that gap between historical and real-time querying
k
You can use ebs mounted volume
n
Right, but price-wise aren't EBS volumes much more expensive than S3 storage?
k
Yes,
We don’t have native s3 support as of now -
y
You can think of Pinot as an indexing engine, so you can index the fields that you will query. If you want to explore on demand caching, there is no such thing in Pinot yet. However, you can explore other file system caching service like alluxio, and mount s3 as underneath storage for Pinot.
Btw I have not tried this Alluxio set up with Pinot. Though in theory it works, you might have to investigate