I'm hitting an issue where the offline data I load...
# troubleshooting
d
I'm hitting an issue where the offline data I loaded into does not match the query results from Pinot. I run
select * from metrics limit 20000
and I see
Copy code
numDocsScanned = 15400
totalDocs = 17118
totalDocs matches my raw data size (what I'd expect). The query results match numDocsScanned (which is wrong). I'm happy to share data and schema in a private message.
m
Are you running the query above as-is, or do you have a filter? If former, then: a) does schema of all segments match? b) are all segments ONLINE in the external view?
d
I'm running that as-is.
The schema for all segments should match. There are 2 star tree indices.
When I use the swagger web ui, I see only 1 segment.
m
in idealstate or in external view?
d
It's weird because
Copy code
numSegmentsQueried=2
numServersResponded=2
numSegmentsProcessed=1
numSegmentsMatched=1
numConsumingSegmentsQueried=1
I don't know "in idealstate or in external view"
m
is this a hybrid table?
d
It's a hybrid table
m
if so, perhaps the time column (and unit) is messed up
d
Interesting
m
because of that, the offline segment is getting pruned
d
I'll look into that. Thanks!
m
👍
d
Yea, that seems related. If I query just offline, the counts are correct.
Copy code
select * from metrics_OFFLINE limit 20000
m
yep
d
This is on my testing instance. This data is older. There isn't recent offline or realtime data.
That's probably it
So the last day of offline data isn't being used
Are there documents describing which offline data is used? I found the following but it'd be nice to have a specific description. https://docs.pinot.apache.org/configuration-reference/table#hybrid-table-config
The retention on the realtime table does not overlap with this.
I'm assuming the offline data is not considered finalized (until later offline data is populated?)
m
i think there's a doc explaining how time boundary is computed. But in essence, for days as time unit, time boundary is max(offlineDay) -1