https://pinot.apache.org/ logo
#general
Title
# general
k

Ken Krugler

06/16/2021, 10:40 PM
For an offline (batch-generated) table, if I don’t specify a
segmentIngestionFrequency
, then are
APPEND
and
REFRESH
values for
segmentIngestionType
essentially equivalent?
m

Mayank

06/16/2021, 10:45 PM
Is this a hybrid table? Not specifying the frequency might mess up time boundary depending on time unit.
k

Ken Krugler

06/16/2021, 10:45 PM
Just OFFLINE
I guess the meta-question is what happens if I create a new version of an existing segment file for an offline table, and do a metadata push. I’m assuming that’s a refresh, and Pinot will correctly handle that.
m

Mayank

06/16/2021, 10:49 PM
Another place it is used is for interval check for validation.
Even for APPEND table, you can refersh any segment at any time
That is how backfill works
k

Ken Krugler

06/16/2021, 10:50 PM
So if I’ve got an offline table segment that I update on a daily basis, what’s the recommended settings? use
REFRESH
with a
segmentIngestionFrequency
of 1 day?
m

Mayank

06/16/2021, 10:51 PM
REFRESH is typically used for full refresh of data. These tables typically don't have a time column. If either one is not true for you, you might be ok just with APPEND
k

Ken Krugler

06/16/2021, 10:52 PM
And what guarantees does Pinot provide (if any) for what happens to queries that are executing when an updated segment is being reloaded?
m

Mayank

06/16/2021, 10:53 PM
Single segment update is atomic. As in a query will either see old or new segment, not a partially updated segment.
👍 1
If you are refreshing a bunch of segments, then you can have a situation where some segments are refreshed and others are not
k

Ken Krugler

06/16/2021, 11:00 PM
Thanks, good to know.
Though I’m still curious about the meaning of
segmentIngestionFrequency
for an OFFLINE table. Why does Pinot care if I update every day or every week?
m

Mayank

06/16/2021, 11:02 PM
It is only used in two places (based on what I see with a quick grep of code):
Copy code
1. Time boundary (only applies to hybrid table).
2. There are checks that ensure data is pushed as expected (for operational monitoring).
k

Ken Krugler

06/16/2021, 11:03 PM
OK, guess I need to dig into the operational monitoring stuff more - thanks
m

Mayank

06/16/2021, 11:05 PM
Yeah, think of this situation - Your user of Pinot thinks data is being pushed to it daily, but their pipeline has been failing (and they didn't notice it). The first thing they would do is ask the question - "Why is Pinot not showing my latest data?" We build some checks to ensure we can automatically detect this situation.