hi Druid code gurus I would like ask one question about drui Apache Druid #dev

hi Druid code gurus, I would like ask one question...

Kai Sun

12/01/2023, 8:52 PM

hi Druid code gurus, I would like ask one question about druid kafka ingestion semantic. This is important for us to reason if data is lost or duplicated. Would appreciate if I can get some clarity here. In the StreamAppenderator, the default queue size for

persistExecutor

is 0, which makes it a basically a SynchronizedQueue. The code of ingestion path in

SeekableStreamIndexTaskRunner.runInternal()

is async with this push path running in the

persistExecutor

thread pool if the queue size is greater than 0. When the queue size is 0, it is synced. Basically, when persist action happens, the ingestion path is blocked. The question is this: If the queue size is 0, the restore logic in

SeekableStreamIndexTaskRunner.runInternal()

would keep the exact once semantic of ingestion of Kafka record. If the queue size is greater than 0, the restore logic in

SeekableStreamIndexTaskRunner.runInternal()

would potentially ingest some Kafka record more than once? If my understanding is right, tuning the queue size of

persistExecutor

thread pool has very important semantic difference in terms of ingestion duplication. It would be great to note this in the config docs, The current config doc states the following for

maxPendingPersists

parameter which is the queue size of the thread pool.

~~no (default = 0, meaning one persist can be running concurrently with ingestion, and none can be queued up)~~

This seems not be accurate. In fact, it should be stated as:

no (default = 0, meaning persist would block the ingestion. This would ensure ingestion task recovery would have no duplication of Kafka record ingested.)

Open in Slack

Previous Next