hi Druid code gurus, I would like ask one question...
# dev
k
hi Druid code gurus, I would like ask one question about druid kafka ingestion semantic. This is important for us to reason if data is lost or duplicated. Would appreciate if I can get some clarity here. In the StreamAppenderator, the default queue size for
persistExecutor
is 0, which makes it a basically a SynchronizedQueue. The code of ingestion path in
SeekableStreamIndexTaskRunner.runInternal()
is async with this push path running in the
persistExecutor
thread pool if the queue size is greater than 0. When the queue size is 0, it is synced. Basically, when persist action happens, the ingestion path is blocked. The question is this: If the queue size is 0, the restore logic in
SeekableStreamIndexTaskRunner.runInternal()
would keep the exact once semantic of ingestion of Kafka record. If the queue size is greater than 0, the restore logic in
SeekableStreamIndexTaskRunner.runInternal()
would potentially ingest some Kafka record more than once? If my understanding is right, tuning the queue size of
persistExecutor
thread pool has very important semantic difference in terms of ingestion duplication. It would be great to note this in the config docs, The current config doc states the following for
maxPendingPersists
parameter which is the queue size of the thread pool.
no (default = 0, meaning one persist can be running concurrently with ingestion, and none can be queued up)
This seems not be accurate. In fact, it should be stated as:
no (default = 0, meaning persist would block the ingestion. This would ensure ingestion task recovery would have no duplication of Kafka record ingested.)