How does Pinot's real-time ingestion handle out-of...
# general
j
How does Pinot's real-time ingestion handle out-of-orderness for the event timestamps? eg: if we have event timestamps that may have 30s out-of-order, how does Pinot address that? What about longer time-ranges (1min, 10min?) For context: we have publisher time timestamping that naively allows publishers to specify a "created" timestamp, but we force it to be the current timestamp if it's >30s from the current timestamp. This leads to some out-of-orderness.
m
Pinot does not require ordering of event time stamps. Out of order events are still consumed and indexed (there is no time based partitioning). When you query for a specific time interval, all rows in that interval are processed (regardless of their order of ingestion)/
j
So late events will be ingested into potentially a separate segment for the same time interval?
m
Yes potentially it can go to a separate segment. But as I mentioned, segments are not time partitioned (unlike Druid) so it does not matter.
j
Just to clarify, when you say not time partitioned you mean does not need to be time partitioned, but if a time dimension is specified then segments will be restricted to containing records for a given time-interval of that time dimension?
Since, my assumption is they are provided a time dimension is specified, otherwise it wouldn't be possible to do things like upsert offline segments which take precedence over online segments.
m
No, a segment can have records from any time, Pinot does not enforce that a segment have data from a specific time partition
In a pathalogical case, if you have a 2 day old event come in 'now', it will still be stored in the segment that is open for consumption 'now'.
For your upsert case, the time-boundary is computed as max(OfflineTIme) - 1 day (assuming days granularity). Which means you can have upto a day old event come in into realtime that would still be processed (even though it is not available in offline).
k
when you push something to offline, the time boundary gets updated. Once the time boundary moves, the events coming into real-time that are prior to the time boundary, will not be processed during query. Typically, offline segments are pushed daily and late arriving events come within 24 hours of the actual event time, you are good
j
Ah, that's helpful to know that it happens at the query level than the segment level. So the boundary will always
max(OfflineTime) - 1 unit of granularity
?
k
yes