hey friends long time no chat we currently have a table with Apache Pinot #getting-started

hey friends, long time no chat! we currently have ...

Luis Fernandez

01/11/2023, 6:59 PM

hey friends, long time no chat! we currently have a table with search terms for products, it’s realtime table and we keep the data for 7 days, however we have a project to increment the retention from 7 days to 30 days, however, the longer we look this data by the longer is gonna take for pinot to return results, so we were thinking about strategies to make this better, and one thinking that we had is that we currently store at hourly resolution, but we really don’t need to keep data with that resolution, it could be daily, so we were wondering if we could leverage the realToOfflineTable task or the mergeRollup task. so basically this table would become an hybrid table, we were wondering if you all have any recommendations on how to achieve this with this 2 tools or if there’s any better way to achieve this. also is there a way to tell these task to move data only from a certain period (?) like for example we would like that the data from the oldest day in realtime table be moved to the offline servers but also roll that data up instead of hourly do daily. Thoughts, prayers, concerns?

🍷 1

robert zych

01/12/2023, 5:20 AM

Sounds like all you need is the the RealtimeToOfflineSegmentsTask

robert zych

01/12/2023, 5:21 AM

https://docs.pinot.apache.org/operators/operating-pinot/pinot-managed-offline-flows

robert zych

01/12/2023, 5:25 AM

https://youtu.be/V_KNUUrS6DA▾

Mark Needham

01/13/2023, 1:43 PM

you can follow an example here - https://dev.startree.ai/docs/pinot/recipes/real-time-offline-job

👍 1

Luis Fernandez

01/26/2023, 4:57 PM

yeahh but sooo how do I roll up in the realtime to offline, like on the realtime table I have resolution of time hourly and then I would want it to be daily once it’s move to offline server

Luis Fernandez

01/26/2023, 4:57 PM

I know we have the rollup option but how does it know it has to rollup for a day

👀 1

robert zych

01/26/2023, 5:13 PM

Looks like the amount of data to be processed is determined by the

bucketTimePeriod

setting

robert zych

01/26/2023, 5:16 PM

For a day, you could set it to

1d

Luis Fernandez

01/26/2023, 7:15 PM

so let me explain this better our records right now are in an hourly fashion

Copy code

timestamp   product_id shop_id view_count
1674745200           1        1         2
1674748800           1        1         3
1674752400           2        1         4

this is all hourly records so ideally i would like that to become truncated to the day after it’s rolled up so now this is today at 12

Copy code

timestamp   product_id shop_id view_count
1674691200           1        1         5
1674691200           2        1         4

does that make sense?

Luis Fernandez

01/26/2023, 7:15 PM

will the RealtimeToOfflineSegmentsTask do something like that?

👀 1

robert zych

01/26/2023, 7:38 PM

yes, I believe so. I'll confirm on my end and will get back to you.

robert zych

01/26/2023, 7:44 PM

And I believe your configuration should look something like...

Copy code

"task": {
    "taskTypeConfigsMap": {
      "RealtimeToOfflineSegmentsTask": {
        "bucketTimePeriod": "1d",
        "bufferTimePeriod": "2d",
        "roundBucketTimePeriod": "1d",
        "mergeType": "rollup",
        "view_count.aggregationType": "sum",
        "maxNumRecordsPerSegment": "100000"
      }
    }
  }

robert zych

01/27/2023, 6:05 AM

I'm seeing potential issues (seconds being treated as ms) in the RealtimeToOfflineSegmentsTaskGenerator that are preventing the task from being generated on my end. With a couple of hacks to this generator I was able to get the task running but it doesn't appear to have uploaded the replacement segment correctly.

Mark Needham

01/27/2023, 12:54 PM

did it upload anything or what happened?

robert zych

01/27/2023, 1:48 PM

The minion logs say that the replacement segment was created. Not yet sure why it wasn’t uploaded

robert zych

01/27/2023, 1:50 PM

Would it help to share my minion logs?

robert zych

01/27/2023, 2:09 PM

Here's a snippet from my minion logs where I see the task running.

minion_logs.txt

robert zych

01/27/2023, 2:17 PM

Logs, tables, schema, data, and steps to repro (using 0.12.0)

luis_fernandez.zip

Mark Needham

01/27/2023, 2:49 PM

thanks, I’ll take a look

👍 1

robert zych

01/27/2023, 5:55 PM

Looks like the seconds vs ms confusion was caused by the fact that the sample data was in seconds but my

timeType

is set to ms

Luis Fernandez

01/27/2023, 6:04 PM

like in your schema?

robert zych

01/27/2023, 6:13 PM

yes, and in the schema too.

robert zych

01/27/2023, 6:16 PM

but now my minion logs show a segment being generated with 2 docs

Copy code

Collected stats for 2 documents
Created dictionary for LONG column: shop_id with cardinality: 1, range: 1 to 1
Created dictionary for LONG column: product_id with cardinality: 2, range: 1 to 2
Created dictionary for LONG column: view_count with cardinality: 2, range: 4 to 5
Created dictionary for LONG column: timestamp with cardinality: 1, range: 1674604800000 to 1674604800000

robert zych

01/27/2023, 6:19 PM

this looks good to me. is this the behavior you were expecting?

robert zych

01/27/2023, 6:32 PM

if the timestamp looks off it's because I used timestamps from 1/25:

Copy code

{"timestamp": 1674658800000,"product_id": 1,"shop_id": 1,"view_count": 2}
{"timestamp": 1674662400000,"product_id": 1,"shop_id": 1,"view_count": 3}
{"timestamp": 1674666000000,"product_id": 2,"shop_id": 1,"view_count": 4}

robert zych

01/28/2023, 6:10 AM

@Mark Needham The task is working correctly. I just had to fix the replication in the offline table to match my available servers to allow the segment upload to complete. But the query results now include duplicate results from the same day

robert zych

01/28/2023, 1:42 PM

The query results included data from both realtime and offline tables because the

ingestionConfig

was missing from the offline table. Now when I query for data before the time boundary (ie latest timestamp in offline table - 1 hr) I only get data from the offline table.

Peter Pringle

02/18/2023, 2:56 PM

would be a great feature if we had the option to round and take the first/last value in the time bucket in addition to min max sum.

Open in Slack

Previous Next