hey everyone, What size data do you store on pinot...
# general
m
hey everyone, What size data do you store on pinot, how many machines are used and what are the machine configurations like?our current business is about PB size, but we store in a different way from pinot.we use HBase to store fields' inverted index and write row position in another hbase's table.Then we fetch filtered records from HDFS.we use some technics to reduce random IO, like compression,encoding, store data in batching, cache, etc. Because our data are stored as a row-format, it's really bad when query results hit large numbers. As far as i know, I guess when a query needs to read large segments(if it can't prune data on partition, star-tree...), is it painful for pinot, cause pinot may need to download lots of segments from segment store and rebuild each segment's index in servers' memory?
m
Hi, there are a wide variety of use cases that Pinot powers from varying data and cluster sizes. Happy to help understand your use case better and provide suggestions
Today, Pinot serving nodes maintain a local copy of the segments for serving. So there is no download involved in the query path.
m
Currently,we use presto to execute query on our storage ,firstly getting query's record ids after our inverted index filtering,then fetch whole hitting records into presto and do the rest of execution. most of our use cases are like
select * from table where day between (2017,2022) and fieldA like *AAA*
,there are also some aggregation queries, but it's painful when hitting records are huge because random IO. As far as I know,I founld pinot use technics like rich indexes, segment assignments.I'm thinking if our queries match lots of segments, do we need to hold lots of segments in server memory? And after queries, what will be held for a segment on server side, like different indexes? I guess lots of scan data will hurt caches too..
m
Servers do have a copy of segments on the local disk, and memory map them. So during query execution, whatever is needed is pulled into memory from local disk (not deep-store).
Pinot’s indexing techniques will avoid puilling/reading any data that is not relevant for executing queries.
m
I see, thanks a lot :)