https://pinot.apache.org/ logo
#general
Title
# general
e

Elon

12/17/2019, 1:34 AM
We noticed that select from <table> and <table>_offline started to diverge after a few hours. What is the cause of diverging counts?
c

Chinmay Soman

12/17/2019, 2:07 AM
I'm guessing you have a real-time counterpart of this table ?
e

Elon

12/17/2019, 2:08 AM
We just added it but we had only offline until about 20 mins ago and the counts started to diverge about 1 hour ago (initial deploy was ~6 hours ago)
Is there deduping or something like that?
If the query is done via the broker it returns about 45k less rows for <table> vs <table>_OFFLINE
So now the numbers diverge even more - does the broker only query offline tables?
c

Chinmay Soman

12/17/2019, 2:32 AM
oh the count for table is < count for table_OFFLINE, interesting
are you sure its not > ?
e

Elon

12/17/2019, 2:46 AM
I'm going to blow everything away and restart - this time I will wait to add the realtime table so I can isolate the issue
m

Mayank

12/17/2019, 2:46 AM
Ok, let us know what you find
👍 1
e

Elon

12/17/2019, 2:47 AM
Will do, this might take 20mins - 1 hour
m

Mayank

12/17/2019, 3:05 AM
So once you have a hybrid table, a time filter is added to the query and sent to the internal tables
Pinot maintains time boundary which is max of time in offline - 1
This assumes that your real-time pipeline is working correctly for time >= time boundary.
My guess is that if you didn’t have real-time consumption, your result is only the offline data < time boundary
e

Elon

12/17/2019, 6:16 PM
So I redid everything last night, bulk loaded the data and now the counts match.
m

Mayank

12/17/2019, 6:17 PM
Do you have both offline and real-time?
In the new setup? And is real-time getting events?
e

Elon

12/17/2019, 7:06 PM
I will add the realtime now:) I wanted to verify whether there was an issue.