Apache Pinot #general

Young Seok (Tony) Kim

07/18/2022, 11:14 PM

[Question] Hi, I’ve configured Apache Pinot with deep store connected to Google Cloud Storage. Does this mean that some cold (less frequently used) segments will be persisted in GCS while hot segments will be served as sort of “cache” in Pinot Servers? I’m curious whether • All the segments are distributed in Pinot servers OR • Only frequently used segments are cached in Pinot servers while unused segments are stored in Deep store (Such as GCS/S3/Azure Data Lake Storage / HDFS) I’m asking this because if we add more and more data, I was concerned whether the number of nodes always increase.

Eaugene Thomas

07/19/2022, 6:05 AM

Hi , I was going through https://docs.pinot.apache.org/operators/operating-pinot/tuning/routing#partitioning & had this doubt . From my understanding pinot segments are partitioned timestamp based . But the above doc mentions the segments can be partitioned based on a particular dimensions . Not clear with this part of timestamp based partitioning between segments & dimension based partitioning between segments . Can any one explain more clearly on that ? Thanks in Advance !

Yarden Rokach

07/19/2022, 11:27 AM

Another Pinot meetup is coming on July 30 in Bangalore 🇮🇳 more info at the #C03N1JNHXLY channel 🔥

🔥 5

Yarden Rokach

07/20/2022, 3:36 PM

This is a conferences alert 📣 Speaking opportunities are available at the #C03N1JNHXLY channel. I'm here for any questions you have ❤️ Have a lovely day everyone !

🚀 1

Priyank Bagrecha

07/20/2022, 8:30 PM

this link seems to be broken now. it was working until yesterday.

Priyank Bagrecha

07/20/2022, 8:35 PM

https://docs.pinot.apache.org/v/release-0.9.0/users/tutorials/ingest-parquet-files-from-s3-using-spark works instead

👀 1

Young Seok (Tony) Kim

07/20/2022, 10:55 PM

Hi, this might be related to above issue, but it seems https://docs.pinot.apache.org/ is entirely not available. Is it just me?

Xiang Fu

07/20/2022, 10:56 PM

Yes, there is some dns issue with

<http://apahce.org|apahce.org>

domain that we are fixing

Xiang Fu

07/20/2022, 10:56 PM

please use : https://apache-pinot.gitbook.io/latest for now

👍 3

Young Seok (Tony) Kim

07/20/2022, 10:57 PM

Thanks for providing alternative! 🙂

Sudharsan Kannan

07/21/2022, 3:58 AM

Team, https://docs.pinot.apache.org/ is not accessible

Mohit S

07/22/2022, 11:05 AM

Hey Everyone! Just getting started with Pinot. Is there any example on how to use custom decoder during stream ingestion? I am following this example https://docs.pinot.apache.org/basics/getting-started/pushing-your-streaming-data-to-pinot. My data is in custom binary format. Looks like I have to implement

org.apache.pinot.spi.stream.StreamMessageDecoder

. Any reference code example would be helpful?

Tim Berglund

07/22/2022, 9:40 PM

Hey, folks! StarTree just opened up a new Slack workspace today. Here’s the blog post I wrote outlining what we did and why we did it.

Tim Berglund

07/22/2022, 9:40 PM

Basically, we need a place to talk about StarTree-ish things in addition to Pinot-ish things, and it’s not okay for us, a vendor, to do that in an Apache-branded Slack workspace like this one. We’ll all still be here helping folks new to the Apache Pinot™ community learn and solve problems, but we’ll be intentional about keeping StarTree-specific conversations out of this workspace and in the StarTree one.

Tim Berglund

07/22/2022, 9:40 PM

That said, ours is still a non-commercial, community space. Please head over and join if it sounds interesting. It will look and feel a lot like this one in terms of the general lack of product pitches, other than our shared enthusiasm for Apache Pinot and the growing category of analytical queries that run really really fast. 🙂

Tim Berglund

07/22/2022, 9:40 PM

Just go to https://stree.ai/slack to join. StarTree people will occasionally mention this if it’s a more appropriate venue to address a question; otherwise, we won’t be doing a lot more promoting of it here.

Tim Berglund

07/22/2022, 9:40 PM

Lemme know if you have questions!

dancingcharmander 6

🍷 16

🆒 3

🔥 8

Sukesh Boggavarapu

07/26/2022, 9:42 PM

I have a hybrid table. What tasks should I create in order to have both daily and hourly rollups?

Sukesh Boggavarapu

07/26/2022, 9:43 PM

RealtimeToOfflineSegmentsTask

will generate segments from my real time table and create offline segments.

Sukesh Boggavarapu

07/26/2022, 9:43 PM

Copy code

"RealtimeToOfflineSegmentsTask": {
        "bucketTimePeriod": "1h",
        "bufferTimePeriod": "2h",
        "roundBucketTimePeriod": "1m",
        "mergeType": "rollup",
        "revenue.aggregationType": "sum",
        "maxNumRecordsPerSegment": "100000"
      }

Sukesh Boggavarapu

07/26/2022, 9:44 PM

So that configuration will create an

hourly rollup

and gets added to my offline table of the hybrid tables.

Sukesh Boggavarapu

07/26/2022, 9:44 PM

Can I also do a daily rollup here in

RealtimeToOfflineSegmentsTask

Sukesh Boggavarapu

07/26/2022, 9:45 PM

Or should I create a

MergeRollupTask

in the offline table?

Sukesh Boggavarapu

07/26/2022, 9:45 PM

Copy code

"MergeRollupTask": {
        "1hour.mergeType": "rollup",
        "1hour.bucketTimePeriod": "1h",
        "1hour.bufferTimePeriod": "3h",
        "1hour.maxNumRecordsPerSegment": "1000000",
        "1hour.maxNumRecordsPerTask": "5000000",
        "1hour.maxNumParallelBuckets": "5",
        "1day.mergeType": "rollup",
        "1day.bucketTimePeriod": "1d",
        "1day.bufferTimePeriod": "1d",
        "1day.roundBucketTimePeriod": "1d",
        "1day.maxNumRecordsPerSegment": "1000000",
        "1day.maxNumRecordsPerTask": "5000000",
        "metricColA.aggregationType": "sum",
        "metricColB.aggregationType": "max"
      }

Sukesh Boggavarapu

07/26/2022, 9:46 PM

What does that do? It creates both hourly and daily segments for the same table?

Sukesh Boggavarapu

07/26/2022, 9:46 PM

So, in total would I need both

RealtimeToOfflineSegmentsTask

and

MergeRollupTask

Sukesh Boggavarapu

07/26/2022, 9:46 PM

Thanks

Mugdha Goel

07/27/2022, 3:18 PM

Hello, I am using a gs bucket as my deepstore and I also have a RealtimeToOfflineSegmentsTask setup to convert realtime segments to offline segments . I would like to store only offline segments in gs and not the realtime segments, because reading realtime segments from gs is causing an issue for some of my tables. Where could I find the configuration for specifically storing only offline tables in gs?

Sukesh Boggavarapu

07/28/2022, 12:27 AM

Hi, How do we replace offline segments created by the

RealtimeToOfflineSegmentsTask

MergeRollupTask

if we ever want to?

Sukesh Boggavarapu

07/28/2022, 12:28 AM

Like if we want to replace the segments 30 days ago in the offline tables, how do we go about doing it? Because I am not sure if the segment names created by the ``RealtimeToOfflineSegmentsTask` /

MergeRollupTask

would match with the segment names created by the offline batch ingestion job.