I got some data ingested and am using a star tree index and Apache Pinot #general

I got some data ingested and am using a star tree ...

Aaron Wishnick

05/14/2021, 5:28 PM

I got some data ingested and am using a star tree index and I'm running a query like

select foo, percentiletdigest(bar, 0.5) from mytable group by foo

. I've got

foo

in my

dimensionsSplitOrder

and I've got

PERCENTILE_TDIGEST__bar

as well as

AVG__bar

in my

functionColumnPairs

. My query takes about 700 ms but if I switch it to

avg(bar)

it takes 15 ms. Is it expected that the t-digest would be that much slower? Anything I can do to speed it up?

Xiang Fu

05/14/2021, 5:31 PM

@User does pinot support percentile tdigest in startree?

Xiang Fu

05/14/2021, 5:32 PM

in response stats, do you see same number of docs scanned for both queries?

Jackie

05/14/2021, 5:36 PM

Yes, startree supports TDigest. See https://docs.pinot.apache.org/basics/indexing/star-tree-index for more details

Jackie

05/14/2021, 5:41 PM

Is the query constantly taking 700ms?

Aaron Wishnick

05/14/2021, 5:42 PM

For avg and percentiletdigest, numDocsScanned is 969792.

Aaron Wishnick

05/14/2021, 5:42 PM

Yeah, consistently in that range. It just took 1057 ms when I ran it

Mayank

05/14/2021, 7:35 PM

Yeah tdigest aggregation over 1M docs might take that long

Aaron Wishnick

05/14/2021, 7:36 PM

What does

numDocsScanned

mean in the context of a star tree index?

Mayank

05/14/2021, 7:36 PM

Do you have query latency with just tdigest?

Aaron Wishnick

05/14/2021, 7:36 PM

What do you mean?

Mayank

05/14/2021, 7:37 PM

Query with percentile tdigest but without avg

Aaron Wishnick

05/14/2021, 7:37 PM

Oh sorry, that's what I meant

Mayank

05/14/2021, 7:37 PM

Oh ok

Aaron Wishnick

05/14/2021, 7:37 PM

select foo, percentiletdigest(bar, 0.5) from mytable group by foo

is slow,

select foo, avg(bar) from mytable group by foo

is fast

Mayank

05/14/2021, 7:37 PM

Docs scanned should mean the same

Mayank

05/14/2021, 7:38 PM

Split order helps with filtering

Mayank

05/14/2021, 7:39 PM

@User does it help with group by or just filtering?

Aaron Wishnick

05/14/2021, 7:40 PM

If I have 969792 numDocsScanned and 8950109972 totalDocs, what does numDocsScanned mean? Is that the number of star tree nodes or something?

Jackie

05/14/2021, 7:45 PM

@User Most time just filtering

Jackie

05/14/2021, 7:47 PM

@User Do you need 0.5 percentile or 50 percentile? The aggregation cost of

percentiletdigest

is expected to be much higher than

avg

Aaron Wishnick

05/14/2021, 8:04 PM

Eh I don't actually care about which percentile just yet -- just the performance

Aaron Wishnick

05/14/2021, 8:05 PM

Is there anything I can do to speed it up? A lot of my users here prefer quantiles, I think performance there will really matter

Aaron Wishnick

05/14/2021, 8:05 PM

The avg performance is... awesome

Mayank

05/14/2021, 8:05 PM

Your query does not have filters

Mayank

05/14/2021, 8:05 PM

Will it be the case always?

Aaron Wishnick

05/14/2021, 8:06 PM

Could be

Aaron Wishnick

05/14/2021, 8:06 PM

Right now I only have a small subset of the data, but yeah people might be filtering by date at the very least

Aaron Wishnick

05/14/2021, 8:06 PM

Do you expect filters to help a lot?

Mayank

05/14/2021, 8:06 PM

It will cut down numDocsScanned right

Aaron Wishnick

05/14/2021, 8:07 PM

Right

Aaron Wishnick

05/14/2021, 8:07 PM

I'd expect people to be scanning a similar number of documents if not an order of magnitude more

Mayank

05/14/2021, 8:08 PM

@User Any ideas on using pre-aggergates within star tree here?

Mayank

05/14/2021, 8:09 PM

Also, @User In production you'll have the same cluster size as of right now? Because if you'll have more servers, you'll get better perf

Jackie

05/14/2021, 8:11 PM

foo

is the first dimension in the split order, then it will always use the pre-aggregate doc

Jackie

05/14/2021, 8:12 PM

@User What's the cardinality of

foo

? How many segments do you have right now?

Aaron Wishnick

05/14/2021, 8:22 PM

Foo's cardinality is about 6

Aaron Wishnick

05/14/2021, 8:22 PM

462 segments

Aaron Wishnick

05/14/2021, 8:22 PM

5 servers

Aaron Wishnick

05/14/2021, 8:33 PM

Foo is third in dimensionsSplitOrder, there are 7 fields total in there

Jackie

05/14/2021, 9:05 PM

In that case, in order to further optimize the performance, you may reduce the

maxLeafRecords

threshold. While this will increase the size of the star-tree

Mayank

05/14/2021, 9:06 PM

Just to callout, a lot of the latency inherently comes from the TDigest library.

Mayank

05/14/2021, 9:07 PM

It is pretty good in providing accuracy in limited storage, but there's a latency cost.

Aaron Wishnick

05/14/2021, 9:09 PM

Is q-digest any better? My understanding was that t-digest is faster and more accurate

Aaron Wishnick

05/14/2021, 9:12 PM

Do you have any approximate guidelines around how much faster performance will be and how much more space the star tree will take up as maxLeafRecords is decreased?

Mayank

05/14/2021, 9:14 PM

Yes, t-digest is definitely better than others. But it may not give you 10ms latency if you are aggregating 1M records.

Aaron Wishnick

05/14/2021, 9:16 PM

How can I get to, say, 200ms?

Mayank

05/14/2021, 9:17 PM

Tuning star tree (Jackie?), index size, server cores/jvm/params, etc

Jackie

05/14/2021, 9:18 PM

For star-tree, you can trade performance with extra space by reducing the

maxLeafRecords

Jackie

05/14/2021, 9:19 PM

Reducing that to 1 will give you fully pre-cubed data

Open in Slack

Previous Next