For an offline (batch-generated) table, if I don’t...
# general
k
For an offline (batch-generated) table, if I don’t specify a
segmentIngestionFrequency
, then are
APPEND
and
REFRESH
values for
segmentIngestionType
essentially equivalent?
m
Is this a hybrid table? Not specifying the frequency might mess up time boundary depending on time unit.
k
Just OFFLINE
I guess the meta-question is what happens if I create a new version of an existing segment file for an offline table, and do a metadata push. I’m assuming that’s a refresh, and Pinot will correctly handle that.
m
Another place it is used is for interval check for validation.
Even for APPEND table, you can refersh any segment at any time
That is how backfill works
k
So if I’ve got an offline table segment that I update on a daily basis, what’s the recommended settings? use
REFRESH
with a
segmentIngestionFrequency
of 1 day?
m
REFRESH is typically used for full refresh of data. These tables typically don't have a time column. If either one is not true for you, you might be ok just with APPEND
k
And what guarantees does Pinot provide (if any) for what happens to queries that are executing when an updated segment is being reloaded?
m
Single segment update is atomic. As in a query will either see old or new segment, not a partially updated segment.
👍 1
If you are refreshing a bunch of segments, then you can have a situation where some segments are refreshed and others are not
k
Thanks, good to know.
Though I’m still curious about the meaning of
segmentIngestionFrequency
for an OFFLINE table. Why does Pinot care if I update every day or every week?
m
It is only used in two places (based on what I see with a quick grep of code):
Copy code
1. Time boundary (only applies to hybrid table).
2. There are checks that ensure data is pushed as expected (for operational monitoring).
k
OK, guess I need to dig into the operational monitoring stuff more - thanks
m
Yeah, think of this situation - Your user of Pinot thinks data is being pushed to it daily, but their pipeline has been failing (and they didn't notice it). The first thing they would do is ask the question - "Why is Pinot not showing my latest data?" We build some checks to ensure we can automatically detect this situation.