quick question about the table retention We are observing a Apache Pinot #general

*quick question about the table retention*: We are...

Buchi Reddy

09/21/2020, 7:12 PM

quick question about the table retention: We are observing a behavior where we have 5days retention on a table but when we query, we're getting some records which are older than 5days too. Could this be happening because there are segments spanning the boundary of the 5days? Do we check the retention in the query path to make sure each record returned is within the retention window?

Mayank

09/21/2020, 7:13 PM

Retention is not checked in the query path. Also, retention is a periodic background task and not guaranteed to remove old data instantly. The recommendation is to add an explicit time filter in the query.

Kishore G

09/21/2020, 7:21 PM

whats the table config

Kishore G

09/21/2020, 7:21 PM

you might be missing the pushType in the segments config

Buchi Reddy

09/21/2020, 7:32 PM

Copy code

"segmentsConfig": {
      "schemaName": "eventView",
      "timeType": "MILLISECONDS",
      "segmentPushType": "APPEND",
      "timeColumnName": "event_time_millis",
      "retentionTimeUnit": "DAYS",
      "retentionTimeValue": "5",
      "replication": "1",
      "segmentAssignmentStrategy": "BalanceNumSegmentAssignmentStrategy",
      "replicasPerPartition": "1"
    }

Buchi Reddy

09/21/2020, 7:33 PM

pushType is set to

APPEND

Buchi Reddy

09/21/2020, 7:34 PM

we don't have

segmentPushFrequency

set to anything though.

Buchi Reddy

09/21/2020, 7:40 PM

how does this and segment threshold time work together? Let's say if i've

segmentPushFrequency

set to

HOUR

but

realtime.segment.flush.threshold.time

set to

2d

, what's the behavior? Will segments be rolled for every hour or will it wait till 2d?

Neha Pawar

09/21/2020, 7:44 PM

it will be 2d. In realtime tables, segmentPushFreq is not used for segment completion

Mayank

09/21/2020, 7:45 PM

I think the question is about time-boundary

Mayank

09/21/2020, 7:46 PM

May be not, but good to check how time boundary behaves in this case.

Buchi Reddy

09/21/2020, 8:24 PM

Yeah. This is REALTIME tables only and i'm looking for answer to my original question 🙂

Kishore G

09/21/2020, 8:26 PM

can you check the controller logs for messages around retention

Kishore G

09/21/2020, 8:26 PM

Copy code

<http://LOGGER.info|LOGGER.info>("Start managing retention for table: {}", tableNameWithType);

Buchi Reddy

09/21/2020, 9:03 PM

Copy code

2020/09/19 09:38:18.598 INFO [RetentionManager] [pool-6-thread-6] Start managing retention for table: eventView_REALTIME
2020/09/19 15:38:22.541 INFO [RetentionManager] [pool-6-thread-2] Start managing retention for table: eventView_REALTIME
2020/09/19 21:38:26.703 INFO [RetentionManager] [pool-6-thread-4] Start managing retention for table: eventView_REALTIME
2020/09/20 03:38:30.431 INFO [RetentionManager] [pool-6-thread-1] Start managing retention for table: eventView_REALTIME
2020/09/20 09:38:34.414 INFO [RetentionManager] [pool-6-thread-4] Start managing retention for table: eventView_REALTIME
2020/09/20 15:38:38.372 INFO [RetentionManager] [pool-6-thread-4] Start managing retention for table: eventView_REALTIME
2020/09/20 21:38:42.254 INFO [RetentionManager] [pool-6-thread-5] Start managing retention for table: eventView_REALTIME
2020/09/21 03:38:45.769 INFO [RetentionManager] [pool-6-thread-4] Start managing retention for table: eventView_REALTIME
2020/09/21 09:38:49.292 INFO [RetentionManager] [pool-6-thread-3] Start managing retention for table: eventView_REALTIME
2020/09/21 15:38:52.937 INFO [RetentionManager] [pool-6-thread-3] Start managing retention for table: eventView_REALTIME

Buchi Reddy

09/21/2020, 9:03 PM

These are the last few matching logs for that table

Buchi Reddy

09/21/2020, 9:04 PM

So, it seems like retention is getting triggered for every 6h

Mayank

09/21/2020, 9:05 PM

Yes, that is expected. You should add explicit time filter in the query

Buchi Reddy

09/21/2020, 9:05 PM

okay

Kishore G

09/21/2020, 9:52 PM

Mayank is right about adding explicit filter... I thought your issue was that the no data was get deleted even it’s well past 5 days?

Buchi Reddy

09/21/2020, 11:31 PM

it's getting deleted but a bit late.

Buchi Reddy

09/21/2020, 11:32 PM

It makes sense because the retention isn't checked in the query path

Kishore G

09/22/2020, 12:07 AM

cool, thanks for confirming that.

Buchi Reddy

09/22/2020, 3:50 AM

oh wait.. I notice something weird. The last 5 days of data is getting purged as per the retention but there is data from a couple of days from a few months ago. See the image below. Do you know when could that happen? this is all in REALTIME table only

Mayank

09/22/2020, 3:51 AM

Is this in the ideal state, or in the deepstore?

Mayank

09/22/2020, 3:52 AM

Retention should have removed data from IS for sure, unless somehow those segments contain data in the last 5 days (if incoming events had bad timestamp from future)

Buchi Reddy

09/22/2020, 4:10 AM

what do you mean in ideal state? I think all those segments are present on the servers locally but I can double check.

Buchi Reddy

09/22/2020, 4:10 AM

I was also suspecting if it's due to bad timestamps.. will investigate further to see

Mayank

09/22/2020, 4:11 AM

Segments may be on server disk, but that does not imply they are part of Pinot (as in the server is hosting them). Segments being hosted by a server are present in the IDEAL-STATE

Mayank

09/22/2020, 4:12 AM

You can use the UI to check the ideal-state

Buchi Reddy

09/22/2020, 4:13 AM

oh okay checking..

Buchi Reddy

09/22/2020, 4:14 AM

you're right. they're not present in the ideal state

Mayank

09/22/2020, 4:14 AM

Manoj Singh

09/22/2020, 6:53 AM

so the only possibility of having old segment not getting deleted is if it has any future time stamp event ? Because we are also facing similar issue, where are retention is not working. We see controller logs which mentioned retention manager kicked in every 6 hrs but we still see old data while querying table.

Open in Slack

Previous Next