hey does anyone have any recommendations in a production env Apache Pinot #troubleshooting

hey, does anyone have any recommendations in a pro...

Luis Fernandez

09/09/2021, 5:19 PM

hey, does anyone have any recommendations in a production env as to what to do around capacity for the pinot-server, I’m working on a proof of concept with some real time data, and i already filled up my disk space, what are some of the things we can do to mitigate increasing disk space. I have been reading about this stuff: https://docs.pinot.apache.org/operators/operating-pinot/tiered-storage https://docs.pinot.apache.org/operators/operating-pinot/pinot-managed-offline-flows for now i’m gonna increase my disk size but chances are that i’m gonna fill it up eventually lol, what are some of the things we can do to save disk space?

Mayank

09/09/2021, 5:32 PM

What's the per-day data size you expect to be stored on Pinot servers? And how much are they storing now?

Subbu Subramaniam

09/09/2021, 5:59 PM

@Luis Fernandez back of the envelope storage requirement for a realtime-only table :

numReplcias * dataPerDay * retentionDays

. Hopefully you can get an approximate value of data per day via the size api we have on the controller and dividing it by the number of days your table has been in place. Unless your ingestion rate changes heavily this should hold at a high level

Luis Fernandez

09/09/2021, 6:25 PM

thank you, so my servers are really small so they have 4GB, and they consumed data for 7hours till they got the disk filled up so at this rate i may be looking at ~14GB per day of data? and I have 2 servers going.

Luis Fernandez

09/09/2021, 6:26 PM

if I configure a retention policy it means that after that retention is done I cannot longer access that data thru pinot yea?

Subbu Subramaniam

09/09/2021, 6:33 PM

pinot will remove data that is older than retention time, and free up space. There is a background job in the controller (retention manager) that you need to make sure is enabled (it is enabled by default). Each time it runs, it removes the segments that are completely outside the retention window (looks at the time column value, not really the "age" of data).

Luis Fernandez

09/09/2021, 7:00 PM

that makes a lot of sense @Subbu Subramaniam thank you very much, one last question, what are some of the strategies there are to save disk space, do we compress any of the data in pinot or anything of the sort? thank you very much

Subbu Subramaniam

09/09/2021, 7:11 PM

pinot data is highly compressed (except for the currently consuming segment). It is always a space/performance trade-off. You can remove some indexes if you are willing to take additional query latency. We have a controller API on what may be most effective way of setting up your indexes, you should explore that. https://docs.pinot.apache.org/operators/configuration-recommendation-engine

🙌 1

Mayank

09/09/2021, 8:54 PM

@Luis Fernandez Just to ensure, even if you use tiered storage or managed offline flows, you do need Pinot servers with attached (network attached is fine) to have the servers host the data for your retention and replication.

Luis Fernandez

09/09/2021, 8:57 PM

i’m not sure i followed that last message Mayank

Luis Fernandez

09/09/2021, 9:00 PM

but it did bring a new question https://docs.pinot.apache.org/basics/data-import/pinot-file-system this only refers to the segment store that could be an external system to pinot yea? like GCS for example, but this is only used when new servers are added to the cluster so that they can download any data they need from this central repo or when they need to download a new segment in the case of an offline server, it’s not like servers can actually this data to query data that it’s not currently in their local memory/disk

Subbu Subramaniam

09/09/2021, 10:34 PM

The segment store is the actual copy of pinot data. Consider the data in servers as "cached". The servers download all the segments they are supposed to "host". As of now, they cache all the segments they host until the segments are removed (or retained out), which effectively means a second copy is stored in the servers. In fact, if you have N replicas, then N additional copies are stored. There is some very early work in progress to age out segments from servers (but still keep them in deep store) and fetch them on demand, but this is in very early stages, so, for now, you should account for N additional copies in the servers.

Open in Slack

Previous Next