hey friends, long time no chat! we currently have ...
# getting-started
l
hey friends, long time no chat! we currently have a table with search terms for products, it’s realtime table and we keep the data for 7 days, however we have a project to increment the retention from 7 days to 30 days, however, the longer we look this data by the longer is gonna take for pinot to return results, so we were thinking about strategies to make this better, and one thinking that we had is that we currently store at hourly resolution, but we really don’t need to keep data with that resolution, it could be daily, so we were wondering if we could leverage the realToOfflineTable task or the mergeRollup task. so basically this table would become an hybrid table, we were wondering if you all have any recommendations on how to achieve this with this 2 tools or if there’s any better way to achieve this. also is there a way to tell these task to move data only from a certain period (?) like for example we would like that the data from the oldest day in realtime table be moved to the offline servers but also roll that data up instead of hourly do daily. Thoughts, prayers, concerns?
🍷 1
r
Sounds like all you need is the the RealtimeToOfflineSegmentsTask

https://youtu.be/V_KNUUrS6DA

m
l
yeahh but sooo how do I roll up in the realtime to offline, like on the realtime table I have resolution of time hourly and then I would want it to be daily once it’s move to offline server
I know we have the rollup option but how does it know it has to rollup for a day
👀 1
r
Looks like the amount of data to be processed is determined by the
bucketTimePeriod
setting
For a day, you could set it to
1d
l
so let me explain this better our records right now are in an hourly fashion
Copy code
timestamp   product_id shop_id view_count
1674745200           1        1         2
1674748800           1        1         3
1674752400           2        1         4
this is all hourly records so ideally i would like that to become truncated to the day after it’s rolled up so now this is today at 12
Copy code
timestamp   product_id shop_id view_count
1674691200           1        1         5
1674691200           2        1         4
does that make sense?
will the RealtimeToOfflineSegmentsTask do something like that?
👀 1
r
yes, I believe so. I'll confirm on my end and will get back to you.
And I believe your configuration should look something like...
Copy code
"task": {
    "taskTypeConfigsMap": {
      "RealtimeToOfflineSegmentsTask": {
        "bucketTimePeriod": "1d",
        "bufferTimePeriod": "2d",
        "roundBucketTimePeriod": "1d",
        "mergeType": "rollup",
        "view_count.aggregationType": "sum",
        "maxNumRecordsPerSegment": "100000"
      }
    }
  }
I'm seeing potential issues (seconds being treated as ms) in the RealtimeToOfflineSegmentsTaskGenerator that are preventing the task from being generated on my end. With a couple of hacks to this generator I was able to get the task running but it doesn't appear to have uploaded the replacement segment correctly.
m
did it upload anything or what happened?
r
The minion logs say that the replacement segment was created. Not yet sure why it wasn’t uploaded
Would it help to share my minion logs?
Here's a snippet from my minion logs where I see the task running.
Logs, tables, schema, data, and steps to repro (using 0.12.0)
m
thanks, I’ll take a look
👍 1
r
Looks like the seconds vs ms confusion was caused by the fact that the sample data was in seconds but my
timeType
is set to ms
l
like in your schema?
r
yes, and in the schema too.
but now my minion logs show a segment being generated with 2 docs
Copy code
Collected stats for 2 documents
Created dictionary for LONG column: shop_id with cardinality: 1, range: 1 to 1
Created dictionary for LONG column: product_id with cardinality: 2, range: 1 to 2
Created dictionary for LONG column: view_count with cardinality: 2, range: 4 to 5
Created dictionary for LONG column: timestamp with cardinality: 1, range: 1674604800000 to 1674604800000
this looks good to me. is this the behavior you were expecting?
if the timestamp looks off it's because I used timestamps from 1/25:
Copy code
{"timestamp": 1674658800000,"product_id": 1,"shop_id": 1,"view_count": 2}
{"timestamp": 1674662400000,"product_id": 1,"shop_id": 1,"view_count": 3}
{"timestamp": 1674666000000,"product_id": 2,"shop_id": 1,"view_count": 4}
@Mark Needham The task is working correctly. I just had to fix the replication in the offline table to match my available servers to allow the segment upload to complete. But the query results now include duplicate results from the same day
The query results included data from both realtime and offline tables because the
ingestionConfig
was missing from the offline table. Now when I query for data before the time boundary (ie latest timestamp in offline table - 1 hr) I only get data from the offline table.
p
would be a great feature if we had the option to round and take the first/last value in the time bucket in addition to min max sum.