Weixiang Sun
09/25/2021, 12:10 AM1. The Kafka stream data has two time columns: processed_at and created_at.
2. The processed_at column is in-order inside Kafka stream.
3. The created_at is out-of-order inside Kafka stream
The retention of realtime pinot table is depending on created_at.
If we want to use created_at as timeColumnName, since created_at can be very old, a lot of stale segments can be created.
If we want to use processed_at as timeColumnName, a lot of old orders can live in the realtime table.
Do you guys have any suggestion about which one to choose as timeColumnName?Subbu Subramaniam
09/25/2021, 5:05 PMcreatedAt > now -R
(where R is retention). If R is high, then you need those old segments, so why are you worried about lot of stale segments? Are the values of createdAt
so random that all records ever ingested can be retained? As long as newer records in the kafka topic have reasonable values for `createdAt`(i.e. higher than older ones), I would use createdAt as time column. If necessary, you can add a filter at the time of creating the table to drop records earlier than epoch tableCreationTime - R
. On the other hand, if you can get createdAt values all over the map all the time, then maybe what you need is a REFRESH
table with no time column.Weixiang Sun
09/28/2021, 4:28 PM