We noticed that select from <table> and <...
# general
e
We noticed that select from <table> and <table>_offline started to diverge after a few hours. What is the cause of diverging counts?
c
I'm guessing you have a real-time counterpart of this table ?
e
We just added it but we had only offline until about 20 mins ago and the counts started to diverge about 1 hour ago (initial deploy was ~6 hours ago)
Is there deduping or something like that?
If the query is done via the broker it returns about 45k less rows for <table> vs <table>_OFFLINE
So now the numbers diverge even more - does the broker only query offline tables?
c
oh the count for table is < count for table_OFFLINE, interesting
are you sure its not > ?
e
I'm going to blow everything away and restart - this time I will wait to add the realtime table so I can isolate the issue
m
Ok, let us know what you find
👍 1
e
Will do, this might take 20mins - 1 hour
m
So once you have a hybrid table, a time filter is added to the query and sent to the internal tables
Pinot maintains time boundary which is max of time in offline - 1
This assumes that your real-time pipeline is working correctly for time >= time boundary.
My guess is that if you didn’t have real-time consumption, your result is only the offline data < time boundary
e
So I redid everything last night, bulk loaded the data and now the counts match.
m
Do you have both offline and real-time?
In the new setup? And is real-time getting events?
e
I will add the realtime now:) I wanted to verify whether there was an issue.