Hi. team~ What is the difference between 'append' ...
# general
c
Hi. team~ What is the difference between 'append' and 'refresh' in segmentPushType for offline table? https://docs.pinot.apache.org/configuration-reference/table#segments-config
m
Append means data push is incremental (say hourly/daily) and is appended to the table.
REFRESH implies that all data to this table will be refreshed with each push.
Note, these are way to specify user intent to Pinot. So it can take care of background stuff like time-boundary management, retention etc
k
@User - actually doesn’t REFRESH imply that segments will be refreshed (replaced completely) during the push? Asking because we have a table where we push a subset of segments as updates every day, configured as refresh, and it’s working fine. So it’s not “all data to this table will be refreshed”.
@User - I was looking at the page that @User referenced in this question *(https://docs.pinot.apache.org/configuration-reference/table#segments-config), and noticed that it said to use
IngestionConfig -> BatchIngestionConfig -> segmentPushType
as of 0.7, but the only link to
IngestionConfig
on that page takes you to Ingestion Transformations (https://docs.pinot.apache.org/developers/advanced/ingestion-level-transformations), which doesn’t talk about the configuration settings. I see that
BatchIngestionConfig
is also referenced on https://docs.pinot.apache.org/basics/data-import/batch-ingestion, but that’s more like a tutorial versus field documentation.
m
@User I don’t think Pinot does the REFRESH on its own. It relies on user to overwrite all segments a something that needs to be clarified in docs if it is not. cc: @User
k
Hi @User - My comment wasn’t about how much Pinot does on its own. It’s about whether “all segments” need to be overwritten or not. Based on my experience, refresh just means that the user is responsible for updating segments (which can be all segments for a table, or some subset), but it’s not a full table operation.
m
You are correct user is responsible, and if they only update part of data Pinot won’t flag it
c
@User @User Thank you. I found that offline table's RetentionManager only deletes segments of type 'append'. https://github.com/apache/pinot/blob/82f7aefe3d0e9e9cee5ba519279a1425259210ce/pinot-controller/src/main/java/org/apache/pinot/controller/helix/core/retention/RetentionManager.java#L107 So I was curious about the difference between the two. But I still can't tell the difference between 'refresh' and 'append'. When I tested with sample data (transcript data), both types of tables replaced segments when injecting data in the same time range. I expected that the 'refresh' type replaces the segment, and the 'append' type creates one more segment and the data will be 2 copies, but it was not. Can you tell me in more detail what the difference between these two is the behavior in pinot ? And, is it correct to use 'injestionConfig > batchIngestionConfig > segemntPushType' instead of 'segmentConfig > segmentPushType'? When I set 'injestionConfig > batchIngestionConfig > segemntPushType', it is not actually reflected. When I set 'segmentConfig > segmentPushType', it is actually reflected, and I tested with this setting. tableConfig:
{
"tableName": "transcript_trans",
"tableType": "OFFLINE",
"segmentsConfig": {
"schemaName": "transcript_trans",
"replication": 1,
"timeColumnName": "timestampInEpoch", "timeType": "MILLISECONDS",
"retentionTimeUnit": "HOURS", "retentionTimeValue": 1
},
"tenants": { "broker":"DefaultTenant", "server":"DefaultTenant" },
"tableIndexConfig": {
"loadMode": "MMAP"
},
"ingestionConfig": {
"batchIngestionConfig": {
"segmentPushType": "APPEND"
},
"filterConfig": {
"filterFunction": "Groovy({(score1 as float) >= 4 && (score1 as float) < 6}, score1)"
},
"transformConfigs": [
{
"columnName": "fullName",
"transformFunction": "Groovy({firstName+' '+lastName}, firstName, lastName)"
},
{
"columnName": "scoreSum",
"transformFunction": "Groovy({score1+score2}, score1, score2)"
},
{
"columnName": "datetime",
"transformFunction": "toDateTime(timestampInEpoch, 'yyyy-MM-dd HH:mm:ss')"
}
]
},
"metadata": {}
}
-- pinot controller ui, tableConfig
"ingestionConfig": {
"batchIngestionConfig": {},
"filterConfig": {
"filterFunction": "Groovy({(score1 as float) >= 4 && (score1 as float) < 6}, score1)"
},
m
From Pinot’s behavior difference you are right only difference is that retention manager will delete segments for append table. Also time boundary happens for append. Refresh is more for snapshot type use cases where you want to rewrite entire data in Pinot with each push.
In your case is it a real-time table, or a hybrid table with time column and incremental push to Pinot? If yes that indicates you need append table
But if it is offline only and each time you want to overwrite all data in Pinot, then it is refresh use case
c
Is this the time boundary you mentioned? https://docs.pinot.apache.org/basics/components/broker https://github.com/apache/pinot/blob/cf8b84e8b0d6ab62374048de586ce7da21132906/pino[…]ache/pinot/broker/routing/timeboundary/TimeBoundaryManager.java To test this, I created a hybrid table. Offline tables are of type REFRESH. "ingestionConfig": { "batchIngestionConfig": { "segmentIngestionType": "REFRESH" } }, "segmentsConfig": { "schemaName": "transcript", "segmentPushType": "REFRESH", I tested both injectionConfig and segmentsConfig. studentID,firstName,lastName,gender,subject,score,timestampInEpoch 101,Bob,King,Female,Maths,3.8,1570863600000 102,Bob,King,Female,English,3.5,1571036400000 103,Bob,King,Male,Maths,3.2,1571900400000 104,Bob,King,Male,Maths,3.2,1571900400000 105,Bob,King,Male,Physics,3.6,1572418800000 Injected data into an offline table. The Row with studentID of 105 is not queried, and the time boundary seems to have worked.
m
Yes
c
The Row with studentID of 105 is not queried, and the time boundary seems to have worked.
m
Just to confirm you are just saying everything works as expected or do you have a question?
c
In your previous explanation, I thought that time boundaries only happen with append tables. is not it? https://apache-pinot.slack.com/archives/CDRCA57FC/p1650417794266129?thread_ts=1650342514.345859&amp;cid=CDRCA57FC
m
I see, REFRESH is typically meant to be used with offline only, hence no time boundary. For your setup you are using it as hybrid? If so, what does it mean to refresh all of offline data with every push when real-time is appending? Note there no refresh for real-time (it will always only append consumed data).