Apache Pinot #general

Alex

02/17/2020, 7:51 PM

@User what do you mean?

Kishore G

02/17/2020, 7:54 PM

I meant we dont allow generating star-tree index on the fly. There is no reason to do this other than Star-tree index generation is a bit resource intensive and might impact query performance while index is being generated

Alex

02/17/2020, 7:56 PM

yep, that is what I though. I'm trying to figure out automation that will be required to do those updates, and currently I'm thinking that if we need to update star tree we will run a Segment generation job (Spark or Flink based one) and upload those new segments into Pinot cluster. Does it sound like a ok idea?

Xiang Fu

02/17/2020, 8:00 PM

yes^^

Alex

02/17/2020, 8:02 PM

great! So, which one is a better idea -> use some external dataset to do it (hadoop) or use existent Pinot segment files?

Kishore G

02/17/2020, 8:02 PM

There is something called as Minion

Kishore G

02/17/2020, 8:03 PM

It can be used for doing these tasks.. it’s used for running GDPR related tasks or optimize segments etc

Kishore G

02/17/2020, 8:03 PM

Take a look at that. You can add startree index gen task to it

Alex

02/17/2020, 8:14 PM

I thought minions are deprecated, is it a good idea to use them?

Kishore G

02/17/2020, 9:19 PM

Yes. We will add more and more tasks to minion

Mayank

02/17/2020, 9:21 PM

Curious, where you got that idea about minions being deprecated. If it’s a documentation issue, we should fix that.

Alex

02/17/2020, 9:32 PM

nope, I think somebody mentioned it at some point

Alex

02/17/2020, 9:48 PM

ok, if we use minions -> what is the right way to use them on kubernetes? Just have a cronjob that launches a specific task? Any good examples on github we can check?

Kishore G

02/17/2020, 9:50 PM

Minion uses Helix Task Framework. @User can you share the design doc and some examples for Minion

Alex

02/17/2020, 9:51 PM

so, minions always run?

Mayank

02/17/2020, 9:51 PM

Yes

Alex

02/17/2020, 9:54 PM

what is the trigger to execute the work?

Mayank

02/17/2020, 9:54 PM

IIRC, the tasks are defined using Helix task framework

Kishore G

02/17/2020, 9:55 PM

the trigger can be manual or scheduled

Kishore G

02/17/2020, 9:55 PM

or it can also be based on another resource

Kishore G

02/17/2020, 9:55 PM

for e.g. you can configure it to run some task whenever a new segment is uploaded

Alex

02/17/2020, 9:56 PM

oh, that is good to have

Kishore G

02/17/2020, 9:56 PM

or you can run something every day at 12.00 mid night

Kishore G

02/17/2020, 9:56 PM

or you can even write your own task generator

Kishore G

02/17/2020, 9:57 PM

the good thing about minion framework is it abstracts out all the common things needed to perform some action on a segment

Kishore G

02/17/2020, 9:58 PM

it can download segment, upload segment, do the bookkeeping etc and also provide pinot segment readers etc

Alex

02/17/2020, 10:00 PM

nice. will need to dig into this. On kube it feels like a waste to run infra which just sits idle when there are no tasks. Would be great to provision pods on demand. Is it possible today?

Kishore G

02/17/2020, 10:08 PM

no, but that will be a great enhancement and Pinot has the primitives to achieve that

👍 1

Kishore G

02/17/2020, 10:29 PM

feel free to start an issue around this, I will add my thoughts and provide some pointers

Ting Chen

02/18/2020, 8:12 PM

Does Pinot do early stop when enough results have been already collected? We have queries of form "SELECT * FROM table WHERE userID='H' AND sourceEventTimestamp>=t1 AND sourceEventTimestamp<=t2 ORDER BY sourceEventTimestamp DESC LIMIT 500". The table has been sorted by sourceEventTimestamp and userID has inverted index. I notice that the selectivity of the query is low (meaning many rows passing the condition). So the first 500 results should be collected relatively quick. But the exec times are long i.e., > 10s.