This message was deleted Apache Druid #troubleshooting

Join Slack

This message was deleted.

# troubleshooting

Slackbot

05/26/2023, 8:49 PM

This message was deleted.

John Kowtko

05/26/2023, 11:12 PM

Hi Kai, you don't want to necessarily reduce these: • When the data is first ingested it is placed on-heap in a "row buffer", which stores the data in simple row format. This data is unindexed, and any queries against real-time data will scan these row buffers in their entirety. This is known to not be the most performant scan. • The intermediate perist takes the contents of the row buffer, converts it to a column-based indexed and encoded "mini-segment" and stores it on local disk. This persisted file is much quicker for queries to access than the row store because of the format and indexing. • Before a segment is built there can be many (dozens or even over a hundred) of these intermediate persist files. When a query runs against these again the performance is fairly good. The properties

maxBytesInMemory

and

intermediatePersistPeriod

control when persists occur. • Eventually the size or time threshold is reached and a segment is built and published. At that time these persist files are all collected into one regular sized segment and sent to deep storage. I would suggest reading through this section of the ingestion doc for an explanation of the various parameters involved: https://druid.apache.org/docs/latest/development/extensions-core/kafka-supervisor-reference.html#kafkasupervisortuningconfig Because the persisted files serve queries faster than the row buffer, we generally see people lean towards more persisting and less time in the row buffer, balanced with appropriate timing for building segments. There are a lot of moving parts in this area of the product, so feel free to post more detailed questions.

🙌 1

Open in Slack

Previous Next