Hi team~ What is the difference between append and refresh i Apache Pinot #general

Hi. team~ What is the difference between 'append' ...

coco

04/19/2022, 4:28 AM

Hi. team~ What is the difference between 'append' and 'refresh' in segmentPushType for offline table? https://docs.pinot.apache.org/configuration-reference/table#segments-config

Mayank

04/19/2022, 4:29 AM

Append means data push is incremental (say hourly/daily) and is appended to the table.

Mayank

04/19/2022, 4:29 AM

REFRESH implies that all data to this table will be refreshed with each push.

Mayank

04/19/2022, 4:30 AM

Note, these are way to specify user intent to Pinot. So it can take care of background stuff like time-boundary management, retention etc

Ken Krugler

04/19/2022, 4:15 PM

@User - actually doesn’t REFRESH imply that segments will be refreshed (replaced completely) during the push? Asking because we have a table where we push a subset of segments as updates every day, configured as refresh, and it’s working fine. So it’s not “all data to this table will be refreshed”.

Ken Krugler

04/19/2022, 4:18 PM

@User - I was looking at the page that @User referenced in this question *(https://docs.pinot.apache.org/configuration-reference/table#segments-config), and noticed that it said to use

IngestionConfig -> BatchIngestionConfig -> segmentPushType

as of 0.7, but the only link to

IngestionConfig

on that page takes you to Ingestion Transformations (https://docs.pinot.apache.org/developers/advanced/ingestion-level-transformations), which doesn’t talk about the configuration settings. I see that

BatchIngestionConfig

is also referenced on https://docs.pinot.apache.org/basics/data-import/batch-ingestion, but that’s more like a tutorial versus field documentation.

Mayank

04/19/2022, 5:18 PM

@User I don’t think Pinot does the REFRESH on its own. It relies on user to overwrite all segments a something that needs to be clarified in docs if it is not. cc: @User

Ken Krugler

04/19/2022, 6:09 PM

Hi @User - My comment wasn’t about how much Pinot does on its own. It’s about whether “all segments” need to be overwritten or not. Based on my experience, refresh just means that the user is responsible for updating segments (which can be all segments for a table, or some subset), but it’s not a full table operation.

Mayank

04/19/2022, 6:21 PM

You are correct user is responsible, and if they only update part of data Pinot won’t flag it

coco

04/20/2022, 12:50 AM

@User @User Thank you. I found that offline table's RetentionManager only deletes segments of type 'append'. https://github.com/apache/pinot/blob/82f7aefe3d0e9e9cee5ba519279a1425259210ce/pinot-controller/src/main/java/org/apache/pinot/controller/helix/core/retention/RetentionManager.java#L107 So I was curious about the difference between the two. But I still can't tell the difference between 'refresh' and 'append'. When I tested with sample data (transcript data), both types of tables replaced segments when injecting data in the same time range. I expected that the 'refresh' type replaces the segment, and the 'append' type creates one more segment and the data will be 2 copies, but it was not. Can you tell me in more detail what the difference between these two is the behavior in pinot ? And, is it correct to use 'injestionConfig > batchIngestionConfig > segemntPushType' instead of 'segmentConfig > segmentPushType'? When I set 'injestionConfig > batchIngestionConfig > segemntPushType', it is not actually reflected. When I set 'segmentConfig > segmentPushType', it is actually reflected, and I tested with this setting. tableConfig:

"tableName": "transcript_trans",

"tableType": "OFFLINE",

"segmentsConfig": {

"schemaName": "transcript_trans",

"replication": 1,

"timeColumnName": "timestampInEpoch", "timeType": "MILLISECONDS",

"retentionTimeUnit": "HOURS", "retentionTimeValue": 1

},

"tenants": { "broker":"DefaultTenant", "server":"DefaultTenant" },

"tableIndexConfig": {

"loadMode": "MMAP"

},

"ingestionConfig": {

"batchIngestionConfig": {

"segmentPushType": "APPEND"

},

"filterConfig": {

"filterFunction": "Groovy({(score1 as float) >= 4 && (score1 as float) < 6}, score1)"

},

"transformConfigs": [

"columnName": "fullName",

"transformFunction": "Groovy({firstName+' '+lastName}, firstName, lastName)"

},

"columnName": "scoreSum",

"transformFunction": "Groovy({score1+score2}, score1, score2)"

},

"columnName": "datetime",

"transformFunction": "toDateTime(timestampInEpoch, 'yyyy-MM-dd HH:mm:ss')"

},

"metadata": {}

-- pinot controller ui, tableConfig

"ingestionConfig": {

"batchIngestionConfig": {},

"filterConfig": {

"filterFunction": "Groovy({(score1 as float) >= 4 && (score1 as float) < 6}, score1)"

},

Mayank

04/20/2022, 1:23 AM

From Pinot’s behavior difference you are right only difference is that retention manager will delete segments for append table. Also time boundary happens for append. Refresh is more for snapshot type use cases where you want to rewrite entire data in Pinot with each push.

Mayank

04/20/2022, 1:24 AM

In your case is it a real-time table, or a hybrid table with time column and incremental push to Pinot? If yes that indicates you need append table

Mayank

04/20/2022, 1:24 AM

But if it is offline only and each time you want to overwrite all data in Pinot, then it is refresh use case

coco

04/25/2022, 2:14 AM

Is this the time boundary you mentioned? https://docs.pinot.apache.org/basics/components/broker https://github.com/apache/pinot/blob/cf8b84e8b0d6ab62374048de586ce7da21132906/pino[…]ache/pinot/broker/routing/timeboundary/TimeBoundaryManager.java To test this, I created a hybrid table. Offline tables are of type REFRESH. "ingestionConfig": { "batchIngestionConfig": { "segmentIngestionType": "REFRESH" } }, "segmentsConfig": { "schemaName": "transcript", "segmentPushType": "REFRESH", I tested both injectionConfig and segmentsConfig. studentID,firstName,lastName,gender,subject,score,timestampInEpoch 101,Bob,King,Female,Maths,3.8,1570863600000 102,Bob,King,Female,English,3.5,1571036400000 103,Bob,King,Male,Maths,3.2,1571900400000 104,Bob,King,Male,Maths,3.2,1571900400000 105,Bob,King,Male,Physics,3.6,1572418800000 Injected data into an offline table. The Row with studentID of 105 is not queried, and the time boundary seems to have worked.

Mayank

04/25/2022, 2:15 AM

Yes

coco

04/25/2022, 2:20 AM

The Row with studentID of 105 is not queried, and the time boundary seems to have worked.

Mayank

04/25/2022, 2:23 AM

Just to confirm you are just saying everything works as expected or do you have a question?

coco

04/25/2022, 2:29 AM

In your previous explanation, I thought that time boundaries only happen with append tables. is not it? https://apache-pinot.slack.com/archives/CDRCA57FC/p1650417794266129?thread_ts=1650342514.345859&cid=CDRCA57FC

Mayank

04/25/2022, 1:00 PM

I see, REFRESH is typically meant to be used with offline only, hence no time boundary. For your setup you are using it as hybrid? If so, what does it mean to refresh all of offline data with every push when real-time is appending? Note there no refresh for real-time (it will always only append consumed data).

Open in Slack

Previous Next