*quick question about the table retention*: We are...
# general
b
quick question about the table retention: We are observing a behavior where we have 5days retention on a table but when we query, we're getting some records which are older than 5days too. Could this be happening because there are segments spanning the boundary of the 5days? Do we check the retention in the query path to make sure each record returned is within the retention window?
m
Retention is not checked in the query path. Also, retention is a periodic background task and not guaranteed to remove old data instantly. The recommendation is to add an explicit time filter in the query.
k
whats the table config
you might be missing the pushType in the segments config
b
Copy code
"segmentsConfig": {
      "schemaName": "eventView",
      "timeType": "MILLISECONDS",
      "segmentPushType": "APPEND",
      "timeColumnName": "event_time_millis",
      "retentionTimeUnit": "DAYS",
      "retentionTimeValue": "5",
      "replication": "1",
      "segmentAssignmentStrategy": "BalanceNumSegmentAssignmentStrategy",
      "replicasPerPartition": "1"
    }
pushType is set to
APPEND
we don't have
segmentPushFrequency
set to anything though.
how does this and segment threshold time work together? Let's say if i've
segmentPushFrequency
set to
HOUR
but
realtime.segment.flush.threshold.time
set to
2d
, what's the behavior? Will segments be rolled for every hour or will it wait till 2d?
n
it will be 2d. In realtime tables, segmentPushFreq is not used for segment completion
m
I think the question is about time-boundary
May be not, but good to check how time boundary behaves in this case.
b
Yeah. This is REALTIME tables only and i'm looking for answer to my original question 🙂
k
can you check the controller logs for messages around retention
Copy code
<http://LOGGER.info|LOGGER.info>("Start managing retention for table: {}", tableNameWithType);
b
Copy code
2020/09/19 09:38:18.598 INFO [RetentionManager] [pool-6-thread-6] Start managing retention for table: eventView_REALTIME
2020/09/19 15:38:22.541 INFO [RetentionManager] [pool-6-thread-2] Start managing retention for table: eventView_REALTIME
2020/09/19 21:38:26.703 INFO [RetentionManager] [pool-6-thread-4] Start managing retention for table: eventView_REALTIME
2020/09/20 03:38:30.431 INFO [RetentionManager] [pool-6-thread-1] Start managing retention for table: eventView_REALTIME
2020/09/20 09:38:34.414 INFO [RetentionManager] [pool-6-thread-4] Start managing retention for table: eventView_REALTIME
2020/09/20 15:38:38.372 INFO [RetentionManager] [pool-6-thread-4] Start managing retention for table: eventView_REALTIME
2020/09/20 21:38:42.254 INFO [RetentionManager] [pool-6-thread-5] Start managing retention for table: eventView_REALTIME
2020/09/21 03:38:45.769 INFO [RetentionManager] [pool-6-thread-4] Start managing retention for table: eventView_REALTIME
2020/09/21 09:38:49.292 INFO [RetentionManager] [pool-6-thread-3] Start managing retention for table: eventView_REALTIME
2020/09/21 15:38:52.937 INFO [RetentionManager] [pool-6-thread-3] Start managing retention for table: eventView_REALTIME
These are the last few matching logs for that table
So, it seems like retention is getting triggered for every 6h
m
Yes, that is expected. You should add explicit time filter in the query
b
okay
k
Mayank is right about adding explicit filter... I thought your issue was that the no data was get deleted even it’s well past 5 days?
b
it's getting deleted but a bit late.
It makes sense because the retention isn't checked in the query path
k
cool, thanks for confirming that.
b
oh wait.. I notice something weird. The last 5 days of data is getting purged as per the retention but there is data from a couple of days from a few months ago. See the image below. Do you know when could that happen? this is all in REALTIME table only
m
Is this in the ideal state, or in the deepstore?
Retention should have removed data from IS for sure, unless somehow those segments contain data in the last 5 days (if incoming events had bad timestamp from future)
b
what do you mean in ideal state? I think all those segments are present on the servers locally but I can double check.
I was also suspecting if it's due to bad timestamps.. will investigate further to see
m
Segments may be on server disk, but that does not imply they are part of Pinot (as in the server is hosting them). Segments being hosted by a server are present in the IDEAL-STATE
You can use the UI to check the ideal-state
b
oh okay checking..
you're right. they're not present in the ideal state
m
ok
m
so the only possibility of having old segment not getting deleted is if it has any future time stamp event ? Because we are also facing similar issue, where are retention is not working. We see controller logs which mentioned retention manager kicked in every 6 hrs but we still see old data while querying table.